容器热插拔 DPDK 网卡
Dec 1, 2023 16:30 · 735 words · 2 minute read
这是一个业务上的需求,通过以 DaemonSet 部署的控制器守护进程,向目标容器热插拔 vhost-user unix domain socket 供容器内的 qemu 进程使用(DPDK 网卡);CNI 为 kube-ovn。
在创建要使用 DPDK 网卡的目标 Pod 时,必须为它挂载一个 emptyDir
类型的卷,且名称为 vhostuser-sockets
:
apiVersion: v1
kind: Pod
metadata:
# ...
spec:
containers:
- name: compute
# ...
volumeMounts:
- mountPath: /var/run/openvswitch/vhostuser-sockets
mountPropagation: Bidirectional
name: vhostuser-sockets
volumes:
- emptyDir: {}
name: vhostuser-sockets
我们已经实现了控制器守护进程调用 kube-ovn CNI ADD/DEL
命令在容器中创建/删除 DPDK 网卡:
CNI_COMMAND=ADD CNI_CONTAINERID=5fabffc727432dac08fc03d974dbf9e2aa14e8963a65e30e67e5ebc587b2370a CNI_NETNS=/proc/637226/ns/net CNI_PATH=/opt/cni/bin/kube-ovn CNI_IFNAME=pod59cc33fa39b CNI_ARGS="K8S_POD_NAME=virt-launcher-ecs-test7-qt99w;K8S_POD_NAMESPACE=default" /opt/cni/bin/kube-ovn < /etc/cni/net.d/01-kube-ovn.conflist
qemu 使用 DPDK 网卡有两种模式,我们目前使用 server 模式:
<interface type='vhostuser'>
<mac address='00:00:00:24:65:f4'/>
<source type='unix' path='/var/run/openvswitch/vhostuser-sockets/pod59cc33fa39b' mode='server'/>
<model type='virtio-non-transitional'/>
<driver name='vhost' queues='4'/>
<alias name='ua-np-test-dpdk'/>
<address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</interface>
容器中的 libvirtd 调用 AttachDevice
attach 网卡后,容器中的 /var/run/openvswitch/vhostuser-sockets 目录下会出现一个新 unix domain socket:
kubectl exec -it virt-launcher-ecs-test7-qt99w -- ls -al /var/run/openvswitch/vhostuser-sockets
total 0
drwxrwsrwx 2 root qemu 50 Nov 29 09:21 .
drwxr-xr-x 3 root root 31 Nov 24 07:20 ..
srwxrwxr-x 1 qemu qemu 0 Nov 29 09:21 pod59cc33fa39b
该 usock 同时也会出现在宿主机上,需要根据 Pod UID 拼出完整的路径:
$ kubectl get po virt-launcher-ecs-test7-qt99w -o jsonpath='{.metadata.uid}'
0087f7d0-3024-483b-b3a1-6eb44b24f341
$ ls /var/run/openvswitch/vhost_sockets/0087f7d0-3024-483b-b3a1-6eb44b24f341/vhostuser-sockets/
pod17274e5ba35 pod59cc33fa39b
kube-ovn 的 cni-server 所在的容器已经挂载了宿主机的 /var/lib/kubelet/pods 和 /run/openvswitch 路径作为 hostPath:
$ kubectl get ds kube-ovn-cni -n kube-system -o jsonpath='{.spec.template.spec.volumes}' | jq
[
{
"hostPath": {
"path": "/var/lib/kubelet/pods",
"type": ""
},
"name": "shared-dir"
},
{
"hostPath": {
"path": "/run/openvswitch",
"type": ""
},
"name": "host-run-ovs"
},
# ...
]
在目标 Pod 启动后 cni-server 会将 /var/run/openvswitch/vhost_sockets/${pod-uid}/vhostuser-sockets 路径 bind mount 至 vhostuser-sockets emptyDir 目录:
https://github.com/kubeovn/kube-ovn/blob/v1.10.1/pkg/daemon/handler_linux.go#L21-L54
func createShortSharedDir(pod *v1.Pod, volumeName string) (err error) {
var volume *v1.Volume
for index, v := range pod.Spec.Volumes {
if v.Name == volumeName {
volume = &pod.Spec.Volumes[index]
break
}
}
if volume == nil {
return fmt.Errorf("can not found volume %s in pod %s", volumeName, pod.Name)
}
if volume.EmptyDir == nil {
return fmt.Errorf("volume %s is not empty dir", volume.Name)
}
originSharedDir := fmt.Sprintf("/var/lib/kubelet/pods/%s/volumes/kubernetes.io~empty-dir/%s", pod.UID, volumeName)
newSharedDir := getShortSharedDir(pod.UID, volumeName)
if _, err = os.Stat(newSharedDir); os.IsNotExist(err) {
err = os.MkdirAll(newSharedDir, 0750)
if err != nil {
return fmt.Errorf("createSharedDir: Failed to create dir (%s): %v", newSharedDir, err)
}
if strings.Contains(newSharedDir, util.DefaultHostVhostuserBaseDir) {
klog.Infof("createSharedDir: Mount from %s to %s", originSharedDir, newSharedDir)
err = unix.Mount(originSharedDir, newSharedDir, "", unix.MS_BIND, "")
if err != nil {
return fmt.Errorf("createSharedDir: Failed to bind mount: %s", err)
}
}
return nil
}
return err
}
将 /run/openvswitch/vhost_sockets/0087f7d0-3024-483b-b3a1-6eb44b24f341/vhostuser-sockets bind mount 至 /var/lib/kubelet/pods/0087f7d0-3024-483b-b3a1-6eb44b24f341/volumes/kubernetes.io~empty-dir/vhostuser-sockets 目录
$ ll /var/lib/kubelet/pods/0087f7d0-3024-483b-b3a1-6eb44b24f341/volumes/kubernetes.io~empty-dir/vhostuser-sockets
total 0
srwxrwxr-x 1 qemu qemu 0 Nov 29 17:21 pod59cc33fa39b
$ ll /run/openvswitch/vhost_sockets/0087f7d0-3024-483b-b3a1-6eb44b24f341/vhostuser-sockets
total 0
srwxrwxr-x 1 qemu qemu 0 Nov 29 17:21 pod59cc33fa39b
如此打通了容器内外,使得容器内创建的文件,可以被宿主机上的 ovs 看到;当容器删除,emptyDir 被销毁,容器内 /var/run/openvswitch/vhostuser-sockets 的文件不会残留。
热拔 DPDK 网卡时,libvirtd 调用 DetachDevice
detach 网卡后,容器中的 /var/run/openvswitch/vhostuser-sockets 目录下相应的 usock 会被移除,控制器守护进程再调用 kube-ovn CNI DEL
命令清场。