cgroup 驱动

Jul 27, 2023 00:00 · 2647 words · 6 minute read Linux Container Kubernetes

cgroup(control group)是 Linux 内核提供的功能,限制和隔离进程对资源(CPU、内存、磁盘 I/O 和网络)的使用。cgroup 接口是由 cgroup driver 暴露给用户的(kubelet 和容器运行时)。

我们经常在 kubelet 和容器运行时(containerd)的配置文件中看到 cgroup 相关配置:

  • kubelet(默认 /var/lib/kubelet/config.yaml)

    apiVersion: kubelet.config.k8s.io/v1beta1
    cgroupDriver: systemd
    
  • containerd(默认 /etc/containerd/config.toml)

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    ...
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true
    

有两种 cgroup driver:

  1. cgroupfs
  2. systemd(kubelet 默认)

两种 driver 都有共同的父目录 /sys/fs/cgroup。

systemd

创建 nginx Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"

并找出 Pod uuid:

$ kubectl get po nginx-deployment-67c946cfbb-9m6db -o jsonpath='{.metadata.uid}'
5cbfaa6d-d3c3-4df6-ab6b-3d05fd998b9f

CPU

规则拼出路径并查看 CPU quota:

# /sys/fs/cgroup/cpu/kubepods.slice/burstablekubepods-burstable.slice/kubepods-burstable-pod${uuid+}.slice
$ ll /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice
total 0
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.clone_children
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.procs
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.stat
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_all
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_percpu
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_percpu_sys
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_percpu_user
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_sys
-r--r--r-- 1 root root 0 Jul 26 22:16 cpuacct.usage_user
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.rt_period_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.rt_runtime_us
-rw-r--r-- 1 root root 0 Jul 26 22:16 cpu.shares
-r--r--r-- 1 root root 0 Jul 26 22:16 cpu.stat
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-064f208642012e62cb45eda41d6113ee6107ed429a44e36f1eaf86407135ab3f.scope
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-4c851f314c196ce2258b96b9ada7cd1dd7dfa20475d35fcd644198f28d03389d.scope
-rw-r--r-- 1 root root 0 Jul 26 22:16 notify_on_release
-rw-r--r-- 1 root root 0 Jul 26 22:16 tasks

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice/cpu.cfs_quota_us
50000

即我们在 Pod resources.limits.cpu 中限制的 500m

内存

规则拼出路径并查看 Memory limit:

# /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod${uuid+}.slice
$ ll /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice
total 0
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.clone_children
--w--w--w- 1 root root 0 Jul 26 22:16 cgroup.event_control
-rw-r--r-- 1 root root 0 Jul 26 22:16 cgroup.procs
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-064f208642012e62cb45eda41d6113ee6107ed429a44e36f1eaf86407135ab3f.scope
drwxr-xr-x 2 root root 0 Jul 26 22:16 cri-containerd-4c851f314c196ce2258b96b9ada7cd1dd7dfa20475d35fcd644198f28d03389d.scope
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.failcnt
--w------- 1 root root 0 Jul 26 22:16 memory.force_empty
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.failcnt
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.memsw.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.move_charge_at_immigrate
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.numa_stat
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.oom_control
---------- 1 root root 0 Jul 26 22:16 memory.pressure_level
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.soft_limit_in_bytes
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.stat
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.swappiness
-r--r--r-- 1 root root 0 Jul 26 22:16 memory.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 26 22:16 memory.use_hierarchy
-rw-r--r-- 1 root root 0 Jul 26 22:16 notify_on_release
-rw-r--r-- 1 root root 0 Jul 26 22:16 tasks


$ cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5cbfaa6d_d3c3_4df6_ab6b_3d05fd998b9f.slice/memory.limit_in_bytes
134217728 # 128MiB

即我们在 Pod resources.limits.memory 中限制的 128Mi

cgroupfs

同样创建 nginx Deployment,并找出 Pod uuid:

$ kubectl get po nginx-deployment-6f96cddcf9-5c9ss -o jsonpath='{.metadata.uid}'
2409d111-0100-4da7-94c8-b63c6b659d2e

CPU

规则拼出路径并查看 CPU quota:

# /sys/fs/cgroup/**cpu**/kubepods/burstable/pod${uuid}/cpu.cfs_quota_us
$ cat /sys/fs/cgroup/cpu/kubepods/burstable/pod2409d111-0100-4da7-94c8-b63c6b659d2e/cpu.cfs_quota_us
50000

即我们在 Pod resources.limits.cpu 中限制的 500m

内存

按规则拼出路径并查看 Memory limit:

# /sys/fs/cgroup/**memory**/kubepods/burstable/pod${uuid}/memory.limit_in_bytes
$ cat /sys/fs/cgroup/memory/kubepods/burstable/pod2409d111-0100-4da7-94c8-b63c6b659d2e/memory.limit_in_bytes
134217728 # 128Mi

即我们在 Pod resources.limits.memory 中限制的 128Mi

可见两种 cgroup driver 的区别在于 /sys/fs/cgroup 的子文件路径。

kubelet

我们来看下 kubelet 对于两种 cgroup driver 的处理 https://github.com/kubernetes/kubernetes/blob/e31aafc4fdaa70e3e14b9402efef7bd8d153c0e5/pkg/kubelet/cm/container_manager_linux.go#L266

func NewContainerManager(mountUtil mount.Interface, cadvisorInterface cadvisor.Interface, nodeConfig NodeConfig, failSwapOn bool, devicePluginEnabled bool, recorder record.EventRecorder) (ContainerManager, error) {
    // a lot of code here
    cgroupManager := NewCgroupManager(subsystems, nodeConfig.CgroupDriver)
}

https://github.com/kubernetes/kubernetes/blob/cd6ffff85d257ff9067d59339f2ffdbcc66dc164/pkg/kubelet/cm/cgroup_manager_linux.go#L195-L205

const (
    // libcontainerCgroupfs means use libcontainer with cgroupfs
    libcontainerCgroupfs libcontainerCgroupManagerType = "cgroupfs"
    // libcontainerSystemd means use libcontainer with systemd
    libcontainerSystemd libcontainerCgroupManagerType = "systemd"
    // systemdSuffix is the cgroup name suffix for systemd
    systemdSuffix string = ".slice"
)

func NewCgroupManager(cs *CgroupSubsystems, cgroupDriver string) CgroupManager {
    managerType := libcontainerCgroupfs
    if cgroupDriver == string(libcontainerSystemd) { // systemd
        managerType = libcontainerSystemd
    }
    return &cgroupManagerImpl{
        subsystems: cs,
        adapter:    newLibcontainerAdapter(managerType),
    }
}

目前 kubelet 只支持 cgroupfs 和 systemd 两种 cgroup driver。

linux cgroup 相关的实现都在 https://github.com/kubernetes/kubernetes/blob/cd6ffff85d257ff9067d59339f2ffdbcc66dc164/pkg/kubelet/cm/container_manager_linux.go 文件中。

Name(name CgroupName) 方法用于拼接各种 cgroup driver 对应的 cgroup 相关子路径 https://github.com/kubernetes/kubernetes/blob/cd6ffff85d257ff9067d59339f2ffdbcc66dc164/pkg/kubelet/cm/cgroup_manager_linux.go#L207-L214

func (m *cgroupManagerImpl) Name(name CgroupName) string {
    if m.adapter.cgroupManagerType == libcontainerSystemd {
        return name.ToSystemd()
    }
    return name.ToCgroupfs()
}

假设 CgroupName 字符串数组为 {“kubepods”, “burstable”, “pod1234-abcd-5678-efgh”}:

  • systemd name.ToSystemd()

    func (cgroupName CgroupName) ToSystemd() string {
        if len(cgroupName) == 0 || (len(cgroupName) == 1 && cgroupName[0] == "") {
            return "/"
        }
        newparts := []string{}
        for _, part := range cgroupName {
            part = escapeSystemdCgroupName(part)
            newparts = append(newparts, part)
        }
    
        result, err := cgroupsystemd.ExpandSlice(strings.Join(newparts, "-") + systemdSuffix)
        if err != nil {
            // Should never happen...
            panic(fmt.Errorf("error converting cgroup name [%v] to systemd format: %v", cgroupName, err))
        }
        return result
    }
    

    systemd 的 cgroup 子路径拼接要麻烦一些,CgroupName 会被拼接为 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1234_abcd_5678_efgh.slice,注意这里 Pod uuid 中的 - 替换成了 _,即上面 systemd 案例中的格式。

  • cgroupfs name.ToCgroupfs()

    func (cgroupName CgroupName) ToCgroupfs() string {
        return "/" + path.Join(cgroupName...)
    }
    

    cgroupfs 的 cgroup 子路径拼接相对简单,CgroupName 会被拼接为 /kubepods/burstable/pod1234-abcd-5678-efgh,即上面 cgroupfs 案例中的格式

再来看创建 cgroup 的 Create(cgroupConfig *CgroupConfig) 方法:

type CgroupConfig struct {
    // Fully qualified name prior to any driver specific conversions.
    Name CgroupName
    // ResourceParameters contains various cgroups settings to apply.
    ResourceParameters *ResourceConfig
}

func (m *cgroupManagerImpl) Create(cgroupConfig *CgroupConfig) error {
    // a lot of code here
    if m.adapter.cgroupManagerType == libcontainerSystemd {
        updateSystemdCgroupInfo(libcontainerCgroupConfig, cgroupConfig.Name)
    } else {
        libcontainerCgroupConfig.Path = cgroupConfig.Name.ToCgroupfs()
    }

    // get the manager with the specified cgroup configuration
    manager, err := m.adapter.newManager(libcontainerCgroupConfig, nil)
    if err != nil {
        return err
    }

    // Apply(-1) is a hack to create the cgroup directories for each resource
    // subsystem. The function [cgroups.Manager.apply()] applies cgroup
    // configuration to the process with the specified pid.
    // It creates cgroup files for each subsystems and writes the pid
    // in the tasks file. We use the function to create all the required
    // cgroup files but not attach any "real" pid to the cgroup.
    if err := manager.Apply(-1); err != nil {
        return err
    }

    // it may confuse why we call set after we do apply, but the issue is that runc
    // follows a similar pattern.  it's needed to ensure cpu quota is set properly.
    if err := m.Update(cgroupConfig); err != nil {
        utilruntime.HandleError(fmt.Errorf("cgroup update failed %v", err))
    }

    return nil
}

func (l *libcontainerAdapter) newManager(cgroups *libcontainerconfigs.Cgroup, paths map[string]string) (libcontainercgroups.Manager, error) {
    switch l.cgroupManagerType {
    case libcontainerCgroupfs:
        if libcontainercgroups.IsCgroup2UnifiedMode() {
            return cgroupfs2.NewManager(cgroups, paths["memory"], false)
        }
        return cgroupfs.NewManager(cgroups, paths, false), nil
    case libcontainerSystemd:
        // this means you asked systemd to manage cgroups, but systemd was not on the host, so all you can do is panic...
        if !cgroupsystemd.IsRunningSystemd() {
            panic("systemd cgroup manager not available")
        }
        if libcontainercgroups.IsCgroup2UnifiedMode() {
            return cgroupsystemd.NewUnifiedManager(cgroups, paths["memory"], false), nil
        }
        return cgroupsystemd.NewLegacyManager(cgroups, paths), nil
    }
    return nil, fmt.Errorf("invalid cgroup manager configuration")
}

参数 cgroupConfig 包含了 CgroupName 和限制各种计算资源的值。

  • systemd updateSystemdCgroupInfo 函数

    func updateSystemdCgroupInfo(cgroupConfig *libcontainerconfigs.Cgroup, cgroupName CgroupName) {
        dir, base := path.Split(cgroupName.ToSystemd())
        if dir == "/" {
            dir = "-.slice"
        } else {
            dir = path.Base(dir)
        }
        cgroupConfig.Parent = dir
        cgroupConfig.Name = base
    }
    
  • cgroupfs

    libcontainerCgroupConfig.Path = cgroupConfig.Name.ToCgroupfs()
    

Create 方法最后调用 Update 方法来设置 cgroup:

func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
    // a lot of code here
    manager, err := m.adapter.newManager(libcontainerCgroupConfig, paths)
    if err != nil {
        return fmt.Errorf("failed to create cgroup manager: %v", err)
    }
    return manager.Set(resources)
}
  • systemd 实现:

    Linux cgroup 有两个版本(v1 和 v2),可通过 cat /proc/cgroups | grep hierarchy 确定正在使用的 cgroup 版本

    $ cat /proc/cgroups | grep hierarchy
    #subsys_name hierarchy num_cgroups enabled
    

    如果输出中包含了 hierarchy 则说明系统正在使用 cgroup v2,再找到相应的 systemd 实现:

    func (m *legacyManager) Set(r *configs.Resources) error {
        // If Paths are set, then we are just joining cgroups paths
        // and there is no need to set any values.
        if m.cgroups.Paths != nil {
            return nil
        }
        if r.Unified != nil {
            return cgroups.ErrV1NoUnified
        }
        properties, err := genV1ResourcesProperties(r, m.dbus)
        if err != nil {
            return err
        }
    
        unitName := getUnitName(m.cgroups)
        needsFreeze, needsThaw, err := m.freezeBeforeSet(unitName, r)
        if err != nil {
            return err
        }
    
        if needsFreeze {
            if err := m.doFreeze(configs.Frozen); err != nil {
                // If freezer cgroup isn't supported, we just warn about it.
                logrus.Infof("freeze container before SetUnitProperties failed: %v", err)
            }
        }
        setErr := setUnitProperties(m.dbus, unitName, properties...)
        if needsThaw {
            if err := m.doFreeze(configs.Thawed); err != nil {
                logrus.Infof("thaw container after SetUnitProperties failed: %v", err)
            }
        }
        if setErr != nil {
            return setErr
        }
    
        for _, sys := range legacySubsystems {
            // Get the subsystem path, but don't error out for not found cgroups.
            path, ok := m.paths[sys.Name()]
            if !ok {
                continue
            }
            if err := sys.Set(path, r); err != nil {
                return err
            }
        }
    
        return nil
    }
    

    调用 systemd API 为容器进程设置 cgroups。

  • cgroupfs 实现:

    func (m *manager) Set(r *configs.Resources) error {
        if r == nil {
            return nil
        }
    
        // If Paths are set, then we are just joining cgroups paths
        // and there is no need to set any values.
        if m.cgroups != nil && m.cgroups.Paths != nil {
            return nil
        }
        if r.Unified != nil {
            return cgroups.ErrV1NoUnified
        }
    
        m.mu.Lock()
        defer m.mu.Unlock()
        for _, sys := range subsystems {
            path := m.paths[sys.Name()]
            if err := sys.Set(path, r); err != nil {
                if m.rootless && sys.Name() == "devices" {
                    continue
                }
                // When m.rootless is true, errors from the device subsystem are ignored because it is really not expected to work.
                // However, errors from other subsystems are not ignored.
                // see @test "runc create (rootless + limits + no cgrouppath + no permission) fails with informative error"
                if path == "" {
                    // We never created a path for this cgroup, so we cannot set
                    // limits for it (though we have already tried at this point).
                    return fmt.Errorf("cannot set %s limit: container could not join or create cgroup", sys.Name())
                }
                return err
            }
        }
    
        return nil
    }
    

    直接与 cgroup 文件系统交互为容器进程设置 cgroups。

总结

两种 cgroup driver 对应不同的子文件路径拼接规则,当使用 cgroupfs driver 时,kubelet 和容器运行时直接读写 cgroup 文件系统来配置 cgroups。

但当 systemd 作为 init system 时不推荐使用 cgroupfs 驱动,因为 systemd 本身就是 cgroup 管理器,不希望系统中再出现一个 cgroup 管理器。两种 cgroup 管理器对系统中正在使用的资源的视图不同,在某些情况下会导致使用 systemd driver 的进程不稳定。

参考