KubeVirt 数据卷热插实现

Jul 6, 2023 22:00 · 2884 words · 6 minute read Kubernetes Virtualization Linux KubeVirt

本文中所使用的 PVC 都为块(Block)模式,KubeVirt 版本为 v0.51.0

使用

KubeVirt 支持将 PVC/PV 作为数据卷热插到正在运行的实例(VMI)上。

  1. 首先创建一个块模式的 PVC

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: datavol-hotplug
    namespace: demo
    spec:
    accessModes:
        - ReadWriteMany
    volumeMode: Block
    resources:
        requests:
        storage: 20Gi
    
  2. 使用 virtctl 工具将 PVC 热插至 VMI

    $ virtctl addvolume ecs-test1 --volume-name=datavol-hotplug -n demo --cache=none --persist
    Successfully submitted add volume request to VM ecs-test1 for volume datavol-hotplug
    
  3. ecs-test0 实例中查看磁盘

    $ [  204.449808] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
    [  204.467219] scsi 0:0:0:0: Attached scsi generic sg0 type 0
    [  204.476717] sd 0:0:0:0: Power-on or device reset occurred
    [  204.478632] sd 0:0:0:0: [sda] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
    [  204.480102] sd 0:0:0:0: [sda] Write Protect is off
    [  204.481953] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [  204.487884] sd 0:0:0:0: [sda] Attached SCSI disk
    
    lsblk
    NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    sda      8:0    0   20G  0 disk
    vda    252:0    0   12G  0 disk
    ├─vda1 252:1    0  100M  0 part /boot/efi
    ├─vda2 252:2    0 1000M  0 part /boot
    ├─vda3 252:3    0    1M  0 part
    └─vda4 252:4    0 10.9G  0 part /
    vdb    252:16   0    1M  0 disk
    vdc    252:32   0   20G  0 disk
    

    sda 就是我们刚热插的数据卷。

大家都知道 KubeVirt 项目中虚拟机实例(qemu-kvm)运行在一个名为 virt-launcher 的 Pod 中:

$ kubectl get po -n demo
NAME                            READY   STATUS    RESTARTS   AGE
virt-launcher-ecs-test1-n4pl8   1/1     Running   0          6m21s

本文将深入探索 KubeVirt 如何做到在不重启 Pod 的前提下将块设备热插至 qemu-kvm。

原理

我们从结果回溯,virt-launcher 通过 libvirt 来管理 qemu-kvm 生命周期,热插卷一定会出现在 libvirt domain 的定义中:

$ kubectl exec -it virt-launcher-ecs-test1-n4pl8 -n demo -- virsh dumpxml 1
  <devices>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' error_policy='stop' discard='unmap'/>
      <source dev='/var/run/kubevirt/hotplug-disks/datavol-hotplug' index='4'/>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <serial>datavol-hotplug</serial>
      <alias name='ua-datavol-hotplug'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
  </devices>

相应在 virt-launcher Pod 中热插卷的块设备文件

$ kubectl exec -it virt-launcher-ecs-test1-n4pl8 -n demo -- ls -al /var/run/kubevirt/hotplug-disks/datavol-hotplug
brw-rw---- 1 qemu qemu 252, 240 Jul  4 07:29 /var/run/kubevirt/hotplug-disks/datavol-hotplug

因为磁盘热插时不能重启 qemu-kvm 所在的 Pod,所以 KubeVirt 不会更新 virt-launcher Pod 定义:

$ kubectl get po virt-launcher-ecs-test1-n4pl8 -n demo -o yaml | grep datavol-hotplug

但 VirtualMachine 和 VirtualMachineInstance 对象都会被更新,Volumes 列表中追加数据卷 datavol-hotplug:

$ kubectl get vm ecs-test1 -n demo -ojsonpath='{.spec.template.spec.volumes}' | jq
[
  {
    "name": "bootdisk",
    "persistentVolumeClaim": {
      "claimName": "ecs-test1-bootpvc-bgzrlh"
    }
  },
  {
    "cloudInitNoCloud": {
      "userData": "#cloud-config\npassword: atomic\nssh_pwauth: True\nchpasswd: { expire: False }\nbootcmd:\n- \"setenforce 0\""
    },
    "name": "cloudinitdisk"
  },
  {
    "name": "datavol-hotplug",
    "persistentVolumeClaim": {
      "claimName": "datavol-hotplug",
      "hotpluggable": true
    }
  }
]


$ kubectl get vmi ecs-test1 -n demo -ojsonpath='{.spec.volumes}' | jq
[
  {
    "name": "bootdisk",
    "persistentVolumeClaim": {
      "claimName": "ecs-test1-bootpvc-bgzrlh"
    }
  },
  {
    "cloudInitNoCloud": {
      "userData": "#cloud-config\npassword: atomic\nssh_pwauth: True\nchpasswd: { expire: False }\nbootcmd:\n- \"setenforce 0\""
    },
    "name": "cloudinitdisk"
  },
  {
    "name": "datavol-hotplug",
    "persistentVolumeClaim": {
      "claimName": "datavol-hotplug",
      "hotpluggable": true
    }
  }
]

virt-launcher Pod 同命名空间下,出现了一个新的 Pod(由 virt-controller 监听到 VMI 追加数据卷后创建):

$ kubectl get po -n demo
NAME                            READY   STATUS    RESTARTS   AGE
hp-volume-kdpgl                 1/1     Running   0          17m
virt-launcher-ecs-test1-n4pl8   1/1     Running   0          20m

hp-volume Pod 定义为:

$ kubectl get po hp-volume-kdpgl -n demo -o yaml
apiVersion: v1
kind: Pod
metadata:
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: virt-launcher-ecs-test1-n4pl8
    uid: 7bf1ee01-f81f-4406-a500-b351a6115dd9
  resourceVersion: "70797056"
  uid: 90f44088-5dd0-4d4d-9ca4-d9fee4ef736a
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - bj-node171
  containers:
  - command:
    - /bin/sh
    - -c
    - /usr/bin/container-disk --copy-path /path/hp
    image: registry-1.ict-mec.net:18443/kubevirt/virt-launcher:v0.51.0-dev
    name: hotplug-disk
    volumeDevices:
    - devicePath: /path/datavol-hotplug/368836ba-6f6a-4d07-b7ad-914b856642f6
      name: datavol-hotplug
    volumeMounts:
    - mountPath: /path
      mountPropagation: HostToContainer
      name: hotplug-disks
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-drrz6
      readOnly: true
  volumes:
  - emptyDir: {}
    name: hotplug-disks
  - name: datavol-hotplug
    persistentVolumeClaim:
      claimName: datavol-hotplug
  1. hp-volume Pod 使用了热插卷 PVC datavol-hotplug,在调度完成后,PVC 对应的 PV 将 attach 至该 Pod 所在的宿主机
  2. 因为热插卷 PV 必须 attach 至 VMI 的落点,即 virt-launcher Pod 所在的节点,所以 hp-volume Pod 携带了 Node Affinity,指定 Pod 落点。
  3. hp-volume Pod 的 ownerReferences 被设置为 virt-launcher Pod,当实例被关机或删除,virt-launcher Pod 被 virt-controller 删除,该 Pod 会跟随 virt-launcher Pod 一并删除。

KubeVirt 架构请查看 https://blog.crazytaxii.com/posts/kubevirt_intro/https://github.com/kubevirt/kubevirt/blob/v0.51.0/docs/components.md

virt-controller

hp-volume Pod 由 virt-controller 控制器监听到 VMI 追加数据卷后创建:

https://github.com/kubevirt/kubevirt/blob/ef55f743e394c34ec7e239b8dc361f679ec41738/pkg/virt-controller/watch/vmi.go#L1903-L1934

func (c *VMIController) createAttachmentPodTemplate(vmi *virtv1.VirtualMachineInstance, virtlauncherPod *k8sv1.Pod, volumes []*virtv1.Volume) (*k8sv1.Pod, error) {
    logger := log.Log.Object(vmi)
    var pod *k8sv1.Pod
    var err error

    volumeNamesPVCMap, err := kubevirttypes.VirtVolumesToPVCMap(volumes, c.pvcInformer.GetStore(), virtlauncherPod.Namespace)
    if err != nil {
        return nil, fmt.Errorf("failed to get PVC map: %v", err)
    }
    // a lot of code here
    if len(volumeNamesPVCMap) > 0 {
        pod, err = c.templateService.RenderHotplugAttachmentPodTemplate(volumes, virtlauncherPod, vmi, volumeNamesPVCMap, false)
    }
    return pod, err
}

比对并找出需要被热插的 PVC(即 datavol-hotplug),生成 hp-volume Pod 定义并创建:

https://github.com/kubevirt/kubevirt/blob/ef55f743e394c34ec7e239b8dc361f679ec41738/pkg/virt-controller/watch/vmi.go#L1756-L1772

func (c *VMIController) createAttachmentPod(vmi *virtv1.VirtualMachineInstance, virtLauncherPod *k8sv1.Pod, volumes []*virtv1.Volume) syncError {
    attachmentPodTemplate, _ := c.createAttachmentPodTemplate(vmi, virtLauncherPod, volumes)
    if attachmentPodTemplate == nil {
        return nil
    }
    vmiKey := controller.VirtualMachineInstanceKey(vmi)
    c.podExpectations.ExpectCreations(vmiKey, 1)

    pod, err := c.clientset.CoreV1().Pods(vmi.GetNamespace()).Create(context.Background(), attachmentPodTemplate, v1.CreateOptions{})
    if err != nil {
        c.podExpectations.CreationObserved(vmiKey)
        c.recorder.Eventf(vmi, k8sv1.EventTypeWarning, FailedCreatePodReason, "Error creating attachment pod: %v", err)
        return &syncErrorImpl{fmt.Errorf("Error creating attachment pod %v", err), FailedCreatePodReason}
    }
    c.recorder.Eventf(vmi, k8sv1.EventTypeNormal, SuccessfulCreatePodReason, "Created attachment pod %s", pod.Name)
    return nil
}

所有相关代码请查看 https://github.com/kubevirt/kubevirt/blob/ef55f743e394c34ec7e239b8dc361f679ec41738/pkg/virt-controller/watch/vmi.go

我们的重心在于找出 virt-launcher Pod 中的热插卷对应的块设备文件从何而来,弄清楚该设备如何进入 virt-launcher Pod 中。

回到 hp-volume Pod,运行了一个 container-disk 进程,这是一个 IPC 服务,创建出来一个 Unix Socket:

$ kubectl exec -it hp-volume-kdpgl -n demo -- ls -al /path
total 0
drwxrwxrwx 3 root root 44 Jul  6 03:06 .
drwxr-xr-x 1 root root 63 Jul  6 03:06 ..
drwxr-xr-x 2 root root 50 Jul  6 03:06 datavol-hotplug
srwxr-xr-x 1 root root  0 Jul  6 03:06 hp.sock

而 hp-volume Pod 中的 /path 路径是一个 EmptyDir,所以宿主机上也能够访问到该 usock,通过一定的规则拼接出来:

$ ls -al /var/lib/kubelet/pods/$(kubectl get po hp-volume-kdpgl -n demo -o jsonpath='{.metadata.uid}')/volumes/kubernetes.io~empty-dir/hotplug-disks/hp.sock
srwxr-xr-x 1 root root 0 Jul  6 11:06 /var/lib/kubelet/pods/90f44088-5dd0-4d4d-9ca4-d9fee4ef736a/volumes/kubernetes.io~empty-dir/hotplug-disks/hp.sock

virt-handler

virt-handler 组件本质上也是一个控制器,但这个 Pod 有很高的权限(privileged),在监听到 VirtualMachine 对象的更新事件后,会处理热插盘:

https://github.com/kubevirt/kubevirt/blob/912f2f861c86c6c758a1fc8b2d9f57f6d5b534b5/pkg/virt-handler/vm.go

func (d *VirtualMachineController) processVmUpdate(vmi *v1.VirtualMachineInstance, domainExists bool) error {
    isUnresponsive, isInitialized, err := d.isLauncherClientUnresponsive(vmi)
    if err != nil {
        return err
    }
    if !isInitialized {
        d.Queue.AddAfter(controller.VirtualMachineInstanceKey(vmi), time.Second*1)
        return nil
    } else if isUnresponsive {
        return goerror.New(fmt.Sprintf("Can not update a VirtualMachineInstance with unresponsive command server."))
    }

    d.handlePostMigrationProxyCleanup(vmi)

    if d.isPreMigrationTarget(vmi) {
        return d.vmUpdateHelperMigrationTarget(vmi)
    } else if d.isMigrationSource(vmi) {
        return d.vmUpdateHelperMigrationSource(vmi)
    } else {
        return d.vmUpdateHelperDefault(vmi, domainExists)
    }
}

func (d *VirtualMachineController) vmUpdateHelperDefault(origVMI *v1.VirtualMachineInstance, domainExists bool) error {
    // a lot of code here
        // Try to mount hotplug volume if there is any during startup.
        if err := d.hotplugVolumeMounter.Mount(vmi); err != nil {
            return err
        }
}

https://github.com/kubevirt/kubevirt/blob/74d3d63102736d809bb4a2707a5e9918ab560882/pkg/virt-handler/hotplug-disk/mount.go#L261-L277

func (m *volumeMounter) Mount(vmi *v1.VirtualMachineInstance) error {
    record, err := m.getMountTargetRecord(vmi)
    if err != nil {
        return err
    }
    for _, volumeStatus := range vmi.Status.VolumeStatus {
        if volumeStatus.HotplugVolume == nil {
            // Skip non hotplug volumes
            continue
        }
        sourceUID := volumeStatus.HotplugVolume.AttachPodUID
        if err := m.mountHotplugVolume(vmi, volumeStatus.Name, sourceUID, record); err != nil {
            return err
        }
    }
    return nil
}

func (m *volumeMounter) mountHotplugVolume(vmi *v1.VirtualMachineInstance, volumeName string, sourceUID types.UID, record *vmiMountTargetRecord) error {
    logger := log.DefaultLogger()
    logger.V(4).Infof("Hotplug check volume name: %s", volumeName)
    if sourceUID != types.UID("") {
        if m.isBlockVolume(&vmi.Status, volumeName) {
            logger.V(4).Infof("Mounting block volume: %s", volumeName)
            if err := m.mountBlockHotplugVolume(vmi, volumeName, sourceUID, record); err != nil {
                return err
            }
        }
    }
    return nil
}

从 VMI 的 status 中拿到热插 Pod 即 hp-volume Pod,调用 mountHotplugVolume 方法:

https://github.com/kubevirt/kubevirt/blob/74d3d63102736d809bb4a2707a5e9918ab560882/pkg/virt-handler/hotplug-disk/mount.go#L308-L354

func (m *volumeMounter) mountBlockHotplugVolume(vmi *v1.VirtualMachineInstance, volume string, sourceUID types.UID, record *vmiMountTargetRecord) error {
    virtlauncherUID := m.findVirtlauncherUID(vmi)
    if virtlauncherUID == "" {
        // This is not the node the pod is running on.
        return nil
    }
    targetPath, err := m.hotplugDiskManager.GetHotplugTargetPodPathOnHost(virtlauncherUID)
    if err != nil {
        return err
    }
    // a lot of core here
    if isBlockExists, _ := isBlockDevice(deviceName); !isBlockExists {
        sourceMajor, sourceMinor, permissions, err := m.getSourceMajorMinor(sourceUID, volume)
        if err != nil {
            return err
        }
        if err := m.writePathToMountRecord(deviceName, vmi, record); err != nil {
            return err
        }
        // allow block devices
        if err := m.allowBlockMajorMinor(sourceMajor, sourceMinor, cgroupsManager); err != nil {
            return err
        }
        if _, err = m.createBlockDeviceFile(deviceName, sourceMajor, sourceMinor, permissions); err != nil {
            return err
        }
    }
}
  1. GetHotplugTargetPodPathOnHost 拼出 virt-launcher Pod 中热插数据卷所需的块文件的在宿主机上的路径
  2. getSourceMajorMinor 拿到 hp-volume Pod 中块设备文件即 datavol-hotplug 的 major 和 minor number
  3. createBlockDeviceFile 在 virt-launcher mountns 中创建块文件 /path/to/datavol-hotplug

我们再来看下 createBlockDeviceFile 方法:

https://github.com/kubevirt/kubevirt/blob/74d3d63102736d809bb4a2707a5e9918ab560882/pkg/virt-handler/hotplug-disk/mount.go#L489-L502

var mknodCommand = func(deviceName string, major, minor int64, blockDevicePermissions string) ([]byte, error) {
    output, err := exec.Command("/usr/bin/mknod", "--mode", fmt.Sprintf("0%s", blockDevicePermissions), deviceName, "b", strconv.FormatInt(major, 10), strconv.FormatInt(minor, 10)).CombinedOutput()
    log.Log.V(3).Infof("running mknod. err: %v, output: %s", err, string(output))
    return output, err
}

func (m *volumeMounter) createBlockDeviceFile(deviceName string, major, minor int64, blockDevicePermissions string) (string, error) {
    exists, err := diskutils.FileExists(deviceName)
    if err != nil {
        return "", err
    }
    if !exists {
        out, err := mknodCommand(deviceName, major, minor, blockDevicePermissions)
        if err != nil {
            log.DefaultLogger().Errorf("Error creating block device file: %s, %v", out, err)
            return "", err
        }
    }
    return deviceName, nil
}

调用 mknod,参数为 major & minor number 创建一个块设备文件,这个设备实际上就是随 hp-volume Pod attach 至主机上的 rbd 盘。

Kubernetes 块模式 PV 原理请查看 https://blog.crazytaxii.com/posts/k8s_pv_block_mode/

如此,宿主机上的 rbd 盘设备 /dev/rbd32,hp-volume Pod 中的块设备文件和 virt-launcher 中的块设备文件,指向的是同一个 rbd 卷,所以 KubeVirt 块模式的热插数据卷,理论上和创建实例时就定义的数据卷,性能一致

Unix socket

最后我们再来探索一下 hp-volume Pod 中 container-disk app 创建的 usock 的作用:

$ kubectl exec -it hp-volume-kdpgl -n demo -- ls -al /path
total 0
drwxr-xr-x 2 root root 50 Jul  6 03:06 datavol-hotplug
srwxr-xr-x 1 root root  0 Jul  6 03:06 hp.sock

container-disk 的实现逻辑中,该 usock 无法连接。

https://github.com/kubevirt/kubevirt/blob/74d3d63102736d809bb4a2707a5e9918ab560882/pkg/virt-handler/hotplug-disk/mount.go#L557-594

var socketPath = func(podUID types.UID) string {
    return fmt.Sprintf("pods/%s/volumes/kubernetes.io~empty-dir/hotplug-disks/hp.sock", string(podUID))
}

func (m *volumeMounter) getSourcePodFilePath(sourceUID types.UID, vmi *v1.VirtualMachineInstance, volume string) (string, error) {
    iso := isolationDetector("/path")
    isoRes, err := iso.DetectForSocket(vmi, socketPath(sourceUID))
    if err != nil {
        return "", err
    }
    findmounts, err := LookupFindmntInfoByVolume(volume, isoRes.Pid())
    if err != nil {
        return "", err
    }
    // a lot of code here
    // Did not find the disk image file, return error
    return "", fmt.Errorf("unable to find source disk image path for pod %s", sourceUID)
}

跳转到 DetectForSocket 方法:

https://github.com/kubevirt/kubevirt/blob/74d3d63102736d809bb4a2707a5e9918ab560882/pkg/virt-handler/isolation/detector.go#L86-L102

func (s *socketBasedIsolationDetector) DetectForSocket(vm *v1.VirtualMachineInstance, socket string) (IsolationResult, error) {
    var pid int
    var ppid int
    var err error

    if pid, err = s.getPid(socket); err != nil {
        log.Log.Object(vm).Reason(err).Errorf("Could not get owner Pid of socket %s", socket)
        return nil, err
    }

    // a lot of code here
}

通过 Unix socket 来拿到其所有者进程(即 container-disk)的 Pid:

https://github.com/kubevirt/kubevirt/blob/74d3d63102736d809bb4a2707a5e9918ab560882/pkg/virt-handler/isolation/detector.go#L227-L249

func (s *socketBasedIsolationDetector) getPid(socket string) (int, error) {
    sock, err := net.DialTimeout("unix", socket, time.Duration(isolationDialTimeout)*time.Second)
    if err != nil {
        return -1, err
    }
    defer sock.Close()

    ufile, err := sock.(*net.UnixConn).File()
    if err != nil {
        return -1, err
    }
    // This is the tricky part, which will give us the PID of the owning socket
    ucreds, err := syscall.GetsockoptUcred(int(ufile.Fd()), syscall.SOL_SOCKET, syscall.SO_PEERCRED)
    if err != nil {
        return -1, err
    }

    if int(ucreds.Pid) == 0 {
        return -1, fmt.Errorf("the detected PID is 0. Is the isolation detector running in the host PID namespace?")
    }

    return int(ucreds.Pid), nil
}

总结

KubeVirt 通过中间 Pod(hp-volume)来调用 CSI 将 PV attach 至 virt-launcher Pod 所在的主机,virt-handler(privileged)在 qemu-kvm 所在的 mount 命名空间内创建指向 rbd 块设备的块设备文件,在不重启 virt-launcher Pod 的前提下最大化地保证了 qemu-kvm 所使用的数据卷性能,可以说是非常巧妙。