容器热插拔 MacVTap 网卡
Nov 30, 2023 20:30 · 1367 words · 3 minute read
这是一个业务上的需求,通过以 DaemonSet 部署的控制器守护进程,向目标容器热插拔 MacVTap 网卡供容器内的 qemu 进程使用;CNI 为 kube-ovn。
我们已经实现了控制器守护进程调用 CNI ADD/DEL
命令在目标容器中创建/删除 MacVTap 网卡:
CNI_COMMAND=ADD CNI_CONTAINERID=7e708ea26d1bbca24b11562f0cdca8605880f0f4c2945bbda8d728f41c0fc87a CNI_NETNS=/proc/519879/ns/net CNI_PATH=/opt/cni/bin/kube-ovn CNI_IFNAME=podb22b465632d CNI_ARGS="K8S_POD_NAME=virt-launcher-ecs-test4-macvtap-qkdjl;K8S_POD_NAMESPACE=default" /opt/cni/bin/kube-ovn < /etc/cni/net.d/01-kube-ovn.conflist
虽然容器中出现了一张名为 podb22b465632d 的网卡,但是 Linux 内核为 MacVTap 网卡生成的字符设备文件 /dev/tap${ifindex} 不在容器中,而容器中的进程要使用这个字符设备文件:
$ kubectl exec -it virt-launcher-ecs-test4-macvtap-qkdjl -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
780: eth0@if781: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 00:00:00:68:50:7c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.10.255.141/16 brd 172.10.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fd00:10:16::52/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::200:ff:fe68:507c/64 scope link
valid_lft forever preferred_lft forever
783: pod17274e5ba35@if782: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1500
link/ether 00:00:00:26:b5:ca brd ff:ff:ff:ff:ff:ff
inet6 fe80::200:ff:fe26:b5ca/64 scope link
valid_lft forever preferred_lft forever
849: podb22b465632d@if848: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1500
link/ether 00:00:00:24:ca:35 brd ff:ff:ff:ff:ff:ff
inet6 fe80::200:ff:fe24:ca35/64 scope link
valid_lft forever preferred_lft forever
$ ll /dev/tap849
crw------- 1 root root 235, 2 Nov 29 18:30 /dev/tap849
$ kubectl exec -it virt-launcher-ecs-test4-macvtap-qkdjl -- ls -al /dev/tap849
ls: cannot access '/dev/tap849': No such file or directory
command terminated with exit code 2
本文将提供一种在不重启容器的前提下,将宿主机上 /dev 路径下的设备文件“插入”目标容器的方法。
因为我们的控制器守护进程在容器中以特权模式运行(privileged):
$ kubectl get ds virt-handler -n kubevirt -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq
{
"privileged": true,
"seLinuxOptions": {
"level": "s0"
}
}
容器运行时(containerd)与 runc 为特权的容器映射容器创建时那一刻宿主机上所有的设备文件:
$ cat /run/containerd/io.containerd.runtime.v2.task/k8s.io/96d3c38abeb688983a2612417c959a3cf7dedf530ce567985bdc95c94d21808e/config.json | jq -r '.linux.devices' | head -n 50
[
{
"path": "/dev/autofs",
"type": "c",
"major": 10,
"minor": 235,
"fileMode": 420,
"uid": 0,
"gid": 0
},
{
"path": "/dev/bsg/0:0:0:0",
"type": "c",
"major": 247,
"minor": 0,
"fileMode": 384,
"uid": 0,
"gid": 0
},
{
"path": "/dev/bsg/1:0:0:0",
"type": "c",
"major": 247,
"minor": 1,
"fileMode": 384,
"uid": 0,
"gid": 0
},
{
"path": "/dev/bsg/2:0:0:0",
"type": "c",
"major": 247,
"minor": 2,
"fileMode": 384,
"uid": 0,
"gid": 0
},
{
"path": "/dev/bsg/3:0:0:0",
"type": "c",
"major": 247,
"minor": 3,
"fileMode": 384,
"uid": 0,
"gid": 0
},
{
"path": "/dev/bus/usb/001/001",
"type": "c",
"major": 189,
字符设备文件 /dev/tap849 在 DeamonSet Pod 运行后才被创建出来
控制器守护进程容器虽然无法直接在 /dev 路径下看到在它运行之后创建的字符设备文件 /dev/tap849,但可以通过访问 /proc/1/root/dev 路径实时地看到宿主机上所有的设备文件:
$ kubectl exec -it virt-handler-zg89b -n kubevirt -- ls -al /dev/tap849
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
ls: cannot access '/dev/tap849': No such file or directory
command terminated with exit code 2
$ kubectl exec -it virt-handler-zg89b -n kubevirt -- ls -al /proc/1/root/dev/tap849
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
crw------- 1 root root 235, 2 Nov 29 10:30 /proc/1/root/dev/tap849
控制器守护进程就可以效仿容器运行时,将宿主机上 /dev/tap849 映射至目标容器的 /dev 路径下,实现“热插”:
-
读取 /proc/1/root/dev 路径下源字符设备文件的 major 与 minor
// OpenAtNoFollow safely opens a filedescriptor to a path relative to // rootBase. Any symlink encountered will be treated as invalid and the operation will be aborted. // This works best together with a path first resolved with JoinAndResolveWithRelativeRoot // which can resolve relative paths and symlinks. func OpenAtNoFollow(path *Path) (file *File, err error) { fd, err := open(path.rootBase) if err != nil { return nil, fmt.Errorf("failed opening path %v: %w", path, err) } for _, child := range strings.Split(filepath.Clean(path.relativePath), pathSeparator) { if child == "" { continue } newfd, err := openat(fd, child) _ = syscall.Close(fd) // always close the parent after the lookup if err != nil { return nil, fmt.Errorf("failed opening %s for path %v: %w", child, path, err) } fd = newfd } return &File{fd: fd, path: path}, nil } func StatAtNoFollow(path *Path) (os.FileInfo, error) { pathFd, err := OpenAtNoFollow(path) if err != nil { return nil, err } defer pathFd.Close() return os.Stat(pathFd.SafePath()) } func getSourceMajorMinor(devicePath *safepath.Path) (uint64, os.FileMode, error) { fi, err := safepath.StatAtNoFollow(devicePath) if err != nil { return 0, 0, err } info := fi.Sys().(*syscall.Stat_t) return info.Rdev, fi.Mode(), nil }
-
直接在 /proc/${pid}/root/dev/ 路径下通过 mknod 创建一个同名同类型的文件(控制器守护进程有能力获取到目标容器 1 号进程的 PID)
func MknodAtNoFollow(path *Path, fileName string, mode os.FileMode, dev uint64) (err error) { if err := isSingleElement(fileName); err != nil { return err } parent, err := OpenAtNoFollow(path) if err != nil { return err } defer parent.Close() return mknodat(parent.fd, fileName, uint32(mode), dev) } func mknodat(dirfd int, path string, mode uint32, dev uint64) (err error) { if err := isSingleElement(path); err != nil { return err } return unix.Mknodat(dirfd, path, mode, int(dev)) }
与 mknod 命令类似,源设备文件的 major minor 还有类型作为参数
这样目标容器(virt-launcher-ecs-test4-macvtap-qkdjl)的 /dev 路径下就会出现 tap849 设备文件:
$ kubectl exec -it virt-launcher-ecs-test4-macvtap-qkdjl -- ls -al /dev/tap849
crw------- 1 qemu qemu 235, 2 Nov 29 10:30 /dev/tap849
相反,控制器守护进程删除 /proc/${pid}/root/dev/ 路径下的 tap849 文件即可实现“热拔”。
另外,MacVTap 网卡关联的字符设备文件名称由固定的 tap 前缀和 ${ifindex} 拼成(tap849),特权容器有两种方法来获取到 ifindex:
-
进入容器的网络命名空间,通过 netlink 读取:
link, err := netlink.LinkByName("podb22b465632d") if err != nil { // error processing } ifindex := link.Attrs().Index // ...
-
获取目标容器 1 号进程 PID,读取 /sys/class/net/${iface_name}/ifindex 文件:
$ cat /proc/286140/root/sys/class/net/podb22b465632d/ifindex 849