cgroup 驱动
Jul 27, 2023 00:00 · 2647 words · 6 minute read
cgroup(control group)是 Linux 内核提供的功能,限制和隔离进程对资源(CPU、内存、磁盘 I/O 和网络)的使用。cgroup 接口是由 cgroup driver 暴露给用户的(kubelet 和容器运行时)。
我们经常在 kubelet 和容器运行时(containerd)的配置文件中看到 cgroup 相关配置:
-
kubelet(默认 /var/lib/kubelet/config.yaml)
apiVersion: kubelet.config.k8s.io/v1beta1 cgroupDriver: systemd
-
containerd(默认 /etc/containerd/config.toml)
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] ... [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true
有两种 cgroup driver:
- cgroupfs
- systemd(kubelet 默认)
两种 driver 都有共同的父目录 /sys/fs/cgroup。
systemd
创建 nginx Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
并找出 Pod uuid:
$ kubectl get po nginx-deployment-67c946cfbb-9m6db -o jsonpath='{.metadata.uid}'
5cbfaa6d-d3c3-4df6-ab6b-3d05fd998b9f
CPU
按规则拼出路径并查看 CPU quota:
# /sys/fs/cgroup/cpu/kubepods.slice/burstablekubepods-burstable.slice/kubepods-burstable-pod${uuid+}.slice
$ ll /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice
total 0
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.clone_children
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.procs
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.stat
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_all
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_percpu
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_percpu_sys
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_percpu_user
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_sys
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_user
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.rt_period_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.rt_runtime_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.shares
-r--r--r-- 1 root root 0 Jul 26 22:16 cpu.stat
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-064f208642012e62cb45eda41d6113ee6107ed429a44e36f1eaf86407135ab3f.scope
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-4c851f314c196ce2258b96b9ada7cd1dd7dfa20475d35fcd644198f28d03389d.scope
-rw-r--r-- 1 root root 0 Jul 26 22:16 notify_on_release
-rw-r--r-- 1 root root 0 Jul 26 22:16 tasks
$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice/cpu.cfs_quota_us
50000
即我们在 Pod resources.limits.cpu
中限制的 500m
内存
按规则拼出路径并查看 Memory limit:
# /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod${uuid+}.slice
$ ll /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice
total 0
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.clone_children
--w--w--w- 1 root root 0 Jul 26 22:16 cgroup.event_control
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.procs
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-064f208642012e62cb45eda41d6113ee6107ed429a44e36f1eaf86407135ab3f.scope
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-4c851f314c196ce2258b96b9ada7cd1dd7dfa20475d35fcd644198f28d03389d.scope
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.failcnt
--w------- 1 root root 0 Jul 26 22:16 memory.force_empty
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.failcnt
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.move_charge_at_immigrate
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.numa_stat
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.oom_control
---------- 1 root root 0 Jul 26 22:16 memory.pressure_level
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.soft_limit_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.stat
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.swappiness
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.use_hierarchy
-rw-r--r-- 1 root root 0 Jul 26 22:16 notify_on_release
-rw-r--r-- 1 root root 0 Jul 26 22:16 tasks
$ cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice/memory.limit_in_bytes
134217728 # 128MiB
即我们在 Pod resources.limits.memory
中限制的 128Mi
cgroupfs
同样创建 nginx Deployment,并找出 Pod uuid:
$ kubectl get po nginx-deployment-6f96cddcf9-5c9ss -o jsonpath='{.metadata.uid}'
2409d111-0100-4da7-94c8-b63c6b659d2e
CPU
按规则拼出路径并查看 CPU quota:
# /sys/fs/cgroup/**cpu**/kubepods/burstable/pod${uuid}/cpu.cfs_quota_us
$ cat /sys/fs/cgroup/cpu/kubepods/burstable/pod2409d111-0100-4da7-94c8-b63c6b659d2e/cpu.cfs_quota_us
50000
即我们在 Pod resources.limits.cpu
中限制的 500m
内存
按规则拼出路径并查看 Memory limit:
# /sys/fs/cgroup/**memory**/kubepods/burstable/pod${uuid}/memory.limit_in_bytes
$ cat /sys/fs/cgroup/memory/kubepods/burstable/pod2409d111-0100-4da7-94c8-b63c6b659d2e/memory.limit_in_bytes
134217728 # 128Mi
即我们在 Pod resources.limits.memory
中限制的 128Mi
可见两种 cgroup driver 的区别在于 /sys/fs/cgroup 的子文件路径。
kubelet
我们来看下 kubelet 对于两种 cgroup driver 的处理 https://github.com/kubernetes/kubernetes/blob/e31aafc4fdaa70e3e14b9402efef7bd8d153c0e5/pkg/kubelet/cm/container_manager_linux.go#L266
func NewContainerManager(mountUtil mount.Interface, cadvisorInterface cadvisor.Interface, nodeConfig NodeConfig, failSwapOn bool, devicePluginEnabled bool, recorder record.EventRecorder) (ContainerManager, error) {
// a lot of code here
cgroupManager := NewCgroupManager(subsystems, nodeConfig.CgroupDriver)
}
const (
// libcontainerCgroupfs means use libcontainer with cgroupfs
libcontainerCgroupfs libcontainerCgroupManagerType = "cgroupfs"
// libcontainerSystemd means use libcontainer with systemd
libcontainerSystemd libcontainerCgroupManagerType = "systemd"
// systemdSuffix is the cgroup name suffix for systemd
systemdSuffix string = ".slice"
)
func NewCgroupManager(cs *CgroupSubsystems, cgroupDriver string) CgroupManager {
managerType := libcontainerCgroupfs
if cgroupDriver == string(libcontainerSystemd) { // systemd
managerType = libcontainerSystemd
}
return &cgroupManagerImpl{
subsystems: cs,
adapter: newLibcontainerAdapter(managerType),
}
}
目前 kubelet 只支持 cgroupfs 和 systemd 两种 cgroup driver。
linux cgroup 相关的实现都在 https://github.com/kubernetes/kubernetes/blob/cd6ffff85d257ff9067d59339f2ffdbcc66dc164/pkg/kubelet/cm/container_manager_linux.go 文件中。
Name(name CgroupName)
方法用于拼接各种 cgroup driver 对应的 cgroup 相关子路径 https://github.com/kubernetes/kubernetes/blob/cd6ffff85d257ff9067d59339f2ffdbcc66dc164/pkg/kubelet/cm/cgroup_manager_linux.go#L207-L214
func (m *cgroupManagerImpl) Name(name CgroupName) string {
if m.adapter.cgroupManagerType == libcontainerSystemd {
return name.ToSystemd()
}
return name.ToCgroupfs()
}
假设 CgroupName 字符串数组为 {“kubepods”, “burstable”, “pod1234-abcd-5678-efgh”}:
-
systemd
name.ToSystemd()
func (cgroupName CgroupName) ToSystemd() string { if len(cgroupName) == 0 || (len(cgroupName) == 1 && cgroupName[0] == "") { return "/" } newparts := []string{} for _, part := range cgroupName { part = escapeSystemdCgroupName(part) newparts = append(newparts, part) } result, err := cgroupsystemd.ExpandSlice(strings.Join(newparts, "-") + systemdSuffix) if err != nil { // Should never happen... panic(fmt.Errorf("error converting cgroup name [%v] to systemd format: %v", cgroupName, err)) } return result }
systemd 的 cgroup 子路径拼接要麻烦一些,CgroupName 会被拼接为 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1234_abcd_5678_efgh.slice,注意这里 Pod uuid 中的 - 替换成了 _,即上面 systemd 案例中的格式。
-
cgroupfs
name.ToCgroupfs()
func (cgroupName CgroupName) ToCgroupfs() string { return "/" + path.Join(cgroupName...) }
cgroupfs 的 cgroup 子路径拼接相对简单,CgroupName 会被拼接为 /kubepods/burstable/pod1234-abcd-5678-efgh,即上面 cgroupfs 案例中的格式
再来看创建 cgroup 的 Create(cgroupConfig *CgroupConfig)
方法:
type CgroupConfig struct {
// Fully qualified name prior to any driver specific conversions.
Name CgroupName
// ResourceParameters contains various cgroups settings to apply.
ResourceParameters *ResourceConfig
}
func (m *cgroupManagerImpl) Create(cgroupConfig *CgroupConfig) error {
// a lot of code here
if m.adapter.cgroupManagerType == libcontainerSystemd {
updateSystemdCgroupInfo(libcontainerCgroupConfig, cgroupConfig.Name)
} else {
libcontainerCgroupConfig.Path = cgroupConfig.Name.ToCgroupfs()
}
// get the manager with the specified cgroup configuration
manager, err := m.adapter.newManager(libcontainerCgroupConfig, nil)
if err != nil {
return err
}
// Apply(-1) is a hack to create the cgroup directories for each resource
// subsystem. The function [cgroups.Manager.apply()] applies cgroup
// configuration to the process with the specified pid.
// It creates cgroup files for each subsystems and writes the pid
// in the tasks file. We use the function to create all the required
// cgroup files but not attach any "real" pid to the cgroup.
if err := manager.Apply(-1); err != nil {
return err
}
// it may confuse why we call set after we do apply, but the issue is that runc
// follows a similar pattern. it's needed to ensure cpu quota is set properly.
if err := m.Update(cgroupConfig); err != nil {
utilruntime.HandleError(fmt.Errorf("cgroup update failed %v", err))
}
return nil
}
func (l *libcontainerAdapter) newManager(cgroups *libcontainerconfigs.Cgroup, paths map[string]string) (libcontainercgroups.Manager, error) {
switch l.cgroupManagerType {
case libcontainerCgroupfs:
if libcontainercgroups.IsCgroup2UnifiedMode() {
return cgroupfs2.NewManager(cgroups, paths["memory"], false)
}
return cgroupfs.NewManager(cgroups, paths, false), nil
case libcontainerSystemd:
// this means you asked systemd to manage cgroups, but systemd was not on the host, so all you can do is panic...
if !cgroupsystemd.IsRunningSystemd() {
panic("systemd cgroup manager not available")
}
if libcontainercgroups.IsCgroup2UnifiedMode() {
return cgroupsystemd.NewUnifiedManager(cgroups, paths["memory"], false), nil
}
return cgroupsystemd.NewLegacyManager(cgroups, paths), nil
}
return nil, fmt.Errorf("invalid cgroup manager configuration")
}
参数 cgroupConfig
包含了 CgroupName 和限制各种计算资源的值。
-
systemd
updateSystemdCgroupInfo
函数func updateSystemdCgroupInfo(cgroupConfig *libcontainerconfigs.Cgroup, cgroupName CgroupName) { dir, base := path.Split(cgroupName.ToSystemd()) if dir == "/" { dir = "-.slice" } else { dir = path.Base(dir) } cgroupConfig.Parent = dir cgroupConfig.Name = base }
-
libcontainerCgroupConfig.Path = cgroupConfig.Name.ToCgroupfs()
Create
方法最后调用 Update
方法来设置 cgroup:
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
// a lot of code here
manager, err := m.adapter.newManager(libcontainerCgroupConfig, paths)
if err != nil {
return fmt.Errorf("failed to create cgroup manager: %v", err)
}
return manager.Set(resources)
}
-
systemd 实现:
Linux cgroup 有两个版本(v1 和 v2),可通过
cat /proc/cgroups | grep hierarchy
确定正在使用的 cgroup 版本$ cat /proc/cgroups | grep hierarchy #subsys_name hierarchy num_cgroups enabled
如果输出中包含了 hierarchy 则说明系统正在使用 cgroup v2,再找到相应的 systemd 实现:
func (m *legacyManager) Set(r *configs.Resources) error { // If Paths are set, then we are just joining cgroups paths // and there is no need to set any values. if m.cgroups.Paths != nil { return nil } if r.Unified != nil { return cgroups.ErrV1NoUnified } properties, err := genV1ResourcesProperties(r, m.dbus) if err != nil { return err } unitName := getUnitName(m.cgroups) needsFreeze, needsThaw, err := m.freezeBeforeSet(unitName, r) if err != nil { return err } if needsFreeze { if err := m.doFreeze(configs.Frozen); err != nil { // If freezer cgroup isn't supported, we just warn about it. logrus.Infof("freeze container before SetUnitProperties failed: %v", err) } } setErr := setUnitProperties(m.dbus, unitName, properties...) if needsThaw { if err := m.doFreeze(configs.Thawed); err != nil { logrus.Infof("thaw container after SetUnitProperties failed: %v", err) } } if setErr != nil { return setErr } for _, sys := range legacySubsystems { // Get the subsystem path, but don't error out for not found cgroups. path, ok := m.paths[sys.Name()] if !ok { continue } if err := sys.Set(path, r); err != nil { return err } } return nil }
调用 systemd API 为容器进程设置 cgroups。
-
cgroupfs 实现:
func (m *manager) Set(r *configs.Resources) error { if r == nil { return nil } // If Paths are set, then we are just joining cgroups paths // and there is no need to set any values. if m.cgroups != nil && m.cgroups.Paths != nil { return nil } if r.Unified != nil { return cgroups.ErrV1NoUnified } m.mu.Lock() defer m.mu.Unlock() for _, sys := range subsystems { path := m.paths[sys.Name()] if err := sys.Set(path, r); err != nil { if m.rootless && sys.Name() == "devices" { continue } // When m.rootless is true, errors from the device subsystem are ignored because it is really not expected to work. // However, errors from other subsystems are not ignored. // see @test "runc create (rootless + limits + no cgrouppath + no permission) fails with informative error" if path == "" { // We never created a path for this cgroup, so we cannot set // limits for it (though we have already tried at this point). return fmt.Errorf("cannot set %s limit: container could not join or create cgroup", sys.Name()) } return err } } return nil }
直接与 cgroup 文件系统交互为容器进程设置 cgroups。
总结
两种 cgroup driver 对应不同的子文件路径拼接规则,当使用 cgroupfs driver 时,kubelet 和容器运行时直接读写 cgroup 文件系统来配置 cgroups。
但当 systemd 作为 init system 时不推荐使用 cgroupfs 驱动,因为 systemd 本身就是 cgroup 管理器,不希望系统中再出现一个 cgroup 管理器。两种 cgroup 管理器对系统中正在使用的资源的视图不同,在某些情况下会导致使用 systemd driver 的进程不稳定。