容器热插拔 MacVTap 网卡

Nov 30, 2023 20:30 · 1367 words · 3 minute read Container Linux Kubernetes

这是一个业务上的需求,通过以 DaemonSet 部署的控制器守护进程,向目标容器热插拔 MacVTap 网卡供容器内的 qemu 进程使用;CNI 为 kube-ovn。

我们已经实现了控制器守护进程调用 CNI ADD/DEL 命令在目标容器中创建/删除 MacVTap 网卡:

CNI_COMMAND=ADD CNI_CONTAINERID=7e708ea26d1bbca24b11562f0cdca8605880f0f4c2945bbda8d728f41c0fc87a CNI_NETNS=/proc/519879/ns/net CNI_PATH=/opt/cni/bin/kube-ovn CNI_IFNAME=podb22b465632d CNI_ARGS="K8S_POD_NAME=virt-launcher-ecs-test4-macvtap-qkdjl;K8S_POD_NAMESPACE=default" /opt/cni/bin/kube-ovn < /etc/cni/net.d/01-kube-ovn.conflist

虽然容器中出现了一张名为 podb22b465632d 的网卡,但是 Linux 内核为 MacVTap 网卡生成的字符设备文件 /dev/tap${ifindex} 不在容器中,而容器中的进程要使用这个字符设备文件

$ kubectl exec -it virt-launcher-ecs-test4-macvtap-qkdjl -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
780: eth0@if781: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 00:00:00:68:50:7c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.10.255.141/16 brd 172.10.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd00:10:16::52/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::200:ff:fe68:507c/64 scope link
       valid_lft forever preferred_lft forever
783: pod17274e5ba35@if782: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1500
    link/ether 00:00:00:26:b5:ca brd ff:ff:ff:ff:ff:ff
    inet6 fe80::200:ff:fe26:b5ca/64 scope link
       valid_lft forever preferred_lft forever
849: podb22b465632d@if848: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1500
    link/ether 00:00:00:24:ca:35 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::200:ff:fe24:ca35/64 scope link
       valid_lft forever preferred_lft forever

$ ll /dev/tap849
crw------- 1 root root 235, 2 Nov 29 18:30 /dev/tap849

$ kubectl exec -it virt-launcher-ecs-test4-macvtap-qkdjl -- ls -al /dev/tap849
ls: cannot access '/dev/tap849': No such file or directory
command terminated with exit code 2

本文将提供一种在不重启容器的前提下,将宿主机上 /dev 路径下的设备文件“插入”目标容器的方法。

因为我们的控制器守护进程在容器中以特权模式运行(privileged):

$ kubectl get ds virt-handler -n kubevirt -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq
{
  "privileged": true,
  "seLinuxOptions": {
    "level": "s0"
  }
}

容器运行时(containerd)与 runc 为特权的容器映射容器创建时那一刻宿主机上所有的设备文件:

$ cat /run/containerd/io.containerd.runtime.v2.task/k8s.io/96d3c38abeb688983a2612417c959a3cf7dedf530ce567985bdc95c94d21808e/config.json | jq -r '.linux.devices' | head -n 50
[
  {
    "path": "/dev/autofs",
    "type": "c",
    "major": 10,
    "minor": 235,
    "fileMode": 420,
    "uid": 0,
    "gid": 0
  },
  {
    "path": "/dev/bsg/0:0:0:0",
    "type": "c",
    "major": 247,
    "minor": 0,
    "fileMode": 384,
    "uid": 0,
    "gid": 0
  },
  {
    "path": "/dev/bsg/1:0:0:0",
    "type": "c",
    "major": 247,
    "minor": 1,
    "fileMode": 384,
    "uid": 0,
    "gid": 0
  },
  {
    "path": "/dev/bsg/2:0:0:0",
    "type": "c",
    "major": 247,
    "minor": 2,
    "fileMode": 384,
    "uid": 0,
    "gid": 0
  },
  {
    "path": "/dev/bsg/3:0:0:0",
    "type": "c",
    "major": 247,
    "minor": 3,
    "fileMode": 384,
    "uid": 0,
    "gid": 0
  },
  {
    "path": "/dev/bus/usb/001/001",
    "type": "c",
    "major": 189,

字符设备文件 /dev/tap849 在 DeamonSet Pod 运行后才被创建出来

控制器守护进程容器虽然无法直接在 /dev 路径下看到在它运行之后创建的字符设备文件 /dev/tap849,但可以通过访问 /proc/1/root/dev 路径实时地看到宿主机上所有的设备文件:

$ kubectl exec -it virt-handler-zg89b -n kubevirt -- ls -al /dev/tap849
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
ls: cannot access '/dev/tap849': No such file or directory
command terminated with exit code 2

$ kubectl exec -it virt-handler-zg89b -n kubevirt -- ls -al /proc/1/root/dev/tap849
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
crw------- 1 root root 235, 2 Nov 29 10:30 /proc/1/root/dev/tap849

控制器守护进程就可以效仿容器运行时,将宿主机上 /dev/tap849 映射至目标容器的 /dev 路径下,实现“热插”:

  1. 读取 /proc/1/root/dev 路径下源字符设备文件的 majorminor

    // OpenAtNoFollow safely opens a filedescriptor to a path relative to
    // rootBase. Any symlink encountered will be treated as invalid and the operation will be aborted.
    // This works best together with a path first resolved with JoinAndResolveWithRelativeRoot
    // which can resolve relative paths and symlinks.
    func OpenAtNoFollow(path *Path) (file *File, err error) {
        fd, err := open(path.rootBase)
        if err != nil {
            return nil, fmt.Errorf("failed opening path %v: %w", path, err)
        }
        for _, child := range strings.Split(filepath.Clean(path.relativePath), pathSeparator) {
            if child == "" {
                continue
            }
            newfd, err := openat(fd, child)
            _ = syscall.Close(fd) // always close the parent after the lookup
            if err != nil {
                return nil, fmt.Errorf("failed opening %s for path %v: %w", child, path, err)
            }
            fd = newfd
        }
        return &File{fd: fd, path: path}, nil
    }
    
    func StatAtNoFollow(path *Path) (os.FileInfo, error) {
        pathFd, err := OpenAtNoFollow(path)
        if err != nil {
            return nil, err
        }
        defer pathFd.Close()
        return os.Stat(pathFd.SafePath())
    }
    
    func getSourceMajorMinor(devicePath *safepath.Path) (uint64, os.FileMode, error) {
        fi, err := safepath.StatAtNoFollow(devicePath)
        if err != nil {
            return 0, 0, err
        }
        info := fi.Sys().(*syscall.Stat_t)
        return info.Rdev, fi.Mode(), nil
    }
    
  2. 直接在 /proc/${pid}/root/dev/ 路径下通过 mknod 创建一个同名同类型的文件(控制器守护进程有能力获取到目标容器 1 号进程的 PID)

    func MknodAtNoFollow(path *Path, fileName string, mode os.FileMode, dev uint64) (err error) {
        if err := isSingleElement(fileName); err != nil {
            return err
        }
        parent, err := OpenAtNoFollow(path)
        if err != nil {
            return err
        }
        defer parent.Close()
        return mknodat(parent.fd, fileName, uint32(mode), dev)
    }
    
    func mknodat(dirfd int, path string, mode uint32, dev uint64) (err error) {
        if err := isSingleElement(path); err != nil {
            return err
        }
        return unix.Mknodat(dirfd, path, mode, int(dev))
    }
    

    与 mknod 命令类似,源设备文件的 major minor 还有类型作为参数

这样目标容器(virt-launcher-ecs-test4-macvtap-qkdjl)的 /dev 路径下就会出现 tap849 设备文件:

$ kubectl exec -it virt-launcher-ecs-test4-macvtap-qkdjl -- ls -al /dev/tap849
crw------- 1 qemu qemu 235, 2 Nov 29 10:30 /dev/tap849

相反,控制器守护进程删除 /proc/${pid}/root/dev/ 路径下的 tap849 文件即可实现“热拔”。

另外,MacVTap 网卡关联的字符设备文件名称由固定的 tap 前缀和 ${ifindex} 拼成(tap849),特权容器有两种方法来获取到 ifindex:

  1. 进入容器的网络命名空间,通过 netlink 读取:

    link, err := netlink.LinkByName("podb22b465632d")
    if err != nil {
        // error processing
    }
    ifindex := link.Attrs().Index
    // ...
    
  2. 获取目标容器 1 号进程 PID,读取 /sys/class/net/${iface_name}/ifindex 文件:

    $ cat /proc/286140/root/sys/class/net/podb22b465632d/ifindex
    849