容器热插拔 DPDK 网卡

Dec 1, 2023 16:30 · 735 words · 2 minute read Container Linux Kubernetes

这是一个业务上的需求,通过以 DaemonSet 部署的控制器守护进程,向目标容器热插拔 vhost-user unix domain socket 供容器内的 qemu 进程使用(DPDK 网卡);CNI 为 kube-ovn

在创建要使用 DPDK 网卡的目标 Pod 时,必须为它挂载一个 emptyDir 类型的卷,且名称为 vhostuser-sockets

apiVersion: v1
kind: Pod
metadata:
    # ...
spec:
  containers:
  - name: compute
    # ...
    volumeMounts:
    - mountPath: /var/run/openvswitch/vhostuser-sockets
      mountPropagation: Bidirectional
      name: vhostuser-sockets
  volumes:
  - emptyDir: {}
  name: vhostuser-sockets

我们已经实现了控制器守护进程调用 kube-ovn CNI ADD/DEL 命令在容器中创建/删除 DPDK 网卡:

CNI_COMMAND=ADD CNI_CONTAINERID=5fabffc727432dac08fc03d974dbf9e2aa14e8963a65e30e67e5ebc587b2370a CNI_NETNS=/proc/637226/ns/net CNI_PATH=/opt/cni/bin/kube-ovn CNI_IFNAME=pod59cc33fa39b CNI_ARGS="K8S_POD_NAME=virt-launcher-ecs-test7-qt99w;K8S_POD_NAMESPACE=default" /opt/cni/bin/kube-ovn < /etc/cni/net.d/01-kube-ovn.conflist

qemu 使用 DPDK 网卡有两种模式,我们目前使用 server 模式:

    <interface type='vhostuser'>
      <mac address='00:00:00:24:65:f4'/>
      <source type='unix' path='/var/run/openvswitch/vhostuser-sockets/pod59cc33fa39b' mode='server'/>
      <model type='virtio-non-transitional'/>
      <driver name='vhost' queues='4'/>
      <alias name='ua-np-test-dpdk'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </interface>

容器中的 libvirtd 调用 AttachDevice attach 网卡后,容器中的 /var/run/openvswitch/vhostuser-sockets 目录下会出现一个新 unix domain socket:

kubectl exec -it virt-launcher-ecs-test7-qt99w -- ls -al /var/run/openvswitch/vhostuser-sockets
total 0
drwxrwsrwx 2 root qemu 50 Nov 29 09:21 .
drwxr-xr-x 3 root root 31 Nov 24 07:20 ..
srwxrwxr-x 1 qemu qemu  0 Nov 29 09:21 pod59cc33fa39b

该 usock 同时也会出现在宿主机上,需要根据 Pod UID 拼出完整的路径:

$ kubectl get po virt-launcher-ecs-test7-qt99w -o jsonpath='{.metadata.uid}'
0087f7d0-3024-483b-b3a1-6eb44b24f341

$ ls /var/run/openvswitch/vhost_sockets/0087f7d0-3024-483b-b3a1-6eb44b24f341/vhostuser-sockets/
pod17274e5ba35  pod59cc33fa39b

kube-ovn 的 cni-server 所在的容器已经挂载了宿主机的 /var/lib/kubelet/pods 和 /run/openvswitch 路径作为 hostPath:

$ kubectl get ds kube-ovn-cni -n kube-system -o jsonpath='{.spec.template.spec.volumes}' | jq
[
  {
    "hostPath": {
      "path": "/var/lib/kubelet/pods",
      "type": ""
    },
    "name": "shared-dir"
  },
  {
    "hostPath": {
      "path": "/run/openvswitch",
      "type": ""
    },
    "name": "host-run-ovs"
  },
  # ...
]

在目标 Pod 启动后 cni-server 会将 /var/run/openvswitch/vhost_sockets/${pod-uid}/vhostuser-sockets 路径 bind mount 至 vhostuser-sockets emptyDir 目录:

https://github.com/kubeovn/kube-ovn/blob/v1.10.1/pkg/daemon/handler_linux.go#L21-L54

func createShortSharedDir(pod *v1.Pod, volumeName string) (err error) {
    var volume *v1.Volume
    for index, v := range pod.Spec.Volumes {
        if v.Name == volumeName {
            volume = &pod.Spec.Volumes[index]
            break
        }
    }
    if volume == nil {
        return fmt.Errorf("can not found volume %s in pod %s", volumeName, pod.Name)
    }
    if volume.EmptyDir == nil {
        return fmt.Errorf("volume %s is not empty dir", volume.Name)
    }
    originSharedDir := fmt.Sprintf("/var/lib/kubelet/pods/%s/volumes/kubernetes.io~empty-dir/%s", pod.UID, volumeName)
    newSharedDir := getShortSharedDir(pod.UID, volumeName)
    if _, err = os.Stat(newSharedDir); os.IsNotExist(err) {
        err = os.MkdirAll(newSharedDir, 0750)
        if err != nil {
            return fmt.Errorf("createSharedDir: Failed to create dir (%s): %v", newSharedDir, err)
        }

        if strings.Contains(newSharedDir, util.DefaultHostVhostuserBaseDir) {
            klog.Infof("createSharedDir: Mount from %s to %s", originSharedDir, newSharedDir)
            err = unix.Mount(originSharedDir, newSharedDir, "", unix.MS_BIND, "")
            if err != nil {
                return fmt.Errorf("createSharedDir: Failed to bind mount: %s", err)
            }
        }
        return nil

    }
    return err
}

将 /run/openvswitch/vhost_sockets/0087f7d0-3024-483b-b3a1-6eb44b24f341/vhostuser-sockets bind mount 至 /var/lib/kubelet/pods/0087f7d0-3024-483b-b3a1-6eb44b24f341/volumes/kubernetes.io~empty-dir/vhostuser-sockets 目录

$ ll /var/lib/kubelet/pods/0087f7d0-3024-483b-b3a1-6eb44b24f341/volumes/kubernetes.io~empty-dir/vhostuser-sockets
total 0
srwxrwxr-x 1 qemu qemu 0 Nov 29 17:21 pod59cc33fa39b

$ ll /run/openvswitch/vhost_sockets/0087f7d0-3024-483b-b3a1-6eb44b24f341/vhostuser-sockets
total 0
srwxrwxr-x 1 qemu qemu 0 Nov 29 17:21 pod59cc33fa39b

如此打通了容器内外,使得容器内创建的文件,可以被宿主机上的 ovs 看到;当容器删除,emptyDir 被销毁,容器内 /var/run/openvswitch/vhostuser-sockets 的文件不会残留。

热拔 DPDK 网卡时,libvirtd 调用 DetachDevice detach 网卡后,容器中的 /var/run/openvswitch/vhostuser-sockets 目录下相应的 usock 会被移除,控制器守护进程再调用 kube-ovn CNI DEL 命令清场。