KubeVirt 数据卷热插实现
Jul 6, 2023 22:00 · 2884 words · 6 minute read
本文中所使用的 PVC 都为块(Block)模式,KubeVirt 版本为 v0.51.0。
使用
KubeVirt 支持将 PVC/PV 作为数据卷热插到正在运行的实例(VMI)上。
-
首先创建一个块模式的 PVC
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: datavol-hotplug namespace: demo spec: accessModes: - ReadWriteMany volumeMode: Block resources: requests: storage: 20Gi
-
$ virtctl addvolume ecs-test1 --volume-name=datavol-hotplug -n demo --cache=none --persist Successfully submitted add volume request to VM ecs-test1 for volume datavol-hotplug
-
ecs-test0 实例中查看磁盘
$ [ 204.449808] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 204.467219] scsi 0:0:0:0: Attached scsi generic sg0 type 0 [ 204.476717] sd 0:0:0:0: Power-on or device reset occurred [ 204.478632] sd 0:0:0:0: [sda] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB) [ 204.480102] sd 0:0:0:0: [sda] Write Protect is off [ 204.481953] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 204.487884] sd 0:0:0:0: [sda] Attached SCSI disk lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 20G 0 disk vda 252:0 0 12G 0 disk ├─vda1 252:1 0 100M 0 part /boot/efi ├─vda2 252:2 0 1000M 0 part /boot ├─vda3 252:3 0 1M 0 part └─vda4 252:4 0 10.9G 0 part / vdb 252:16 0 1M 0 disk vdc 252:32 0 20G 0 disk
sda 就是我们刚热插的数据卷。
大家都知道 KubeVirt 项目中虚拟机实例(qemu-kvm)运行在一个名为 virt-launcher 的 Pod 中:
$ kubectl get po -n demo
NAME READY STATUS RESTARTS AGE
virt-launcher-ecs-test1-n4pl8 1/1 Running 0 6m21s
本文将深入探索 KubeVirt 如何做到在不重启 Pod 的前提下将块设备热插至 qemu-kvm。
原理
我们从结果回溯,virt-launcher 通过 libvirt 来管理 qemu-kvm 生命周期,热插卷一定会出现在 libvirt domain 的定义中:
$ kubectl exec -it virt-launcher-ecs-test1-n4pl8 -n demo -- virsh dumpxml 1
<devices>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' error_policy='stop' discard='unmap'/>
<source dev='/var/run/kubevirt/hotplug-disks/datavol-hotplug' index='4'/>
<backingStore/>
<target dev='sda' bus='scsi'/>
<serial>datavol-hotplug</serial>
<alias name='ua-datavol-hotplug'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
</devices>
相应在 virt-launcher Pod 中热插卷的块设备文件:
$ kubectl exec -it virt-launcher-ecs-test1-n4pl8 -n demo -- ls -al /var/run/kubevirt/hotplug-disks/datavol-hotplug
brw-rw---- 1 qemu qemu 252, 240 Jul 4 07:29 /var/run/kubevirt/hotplug-disks/datavol-hotplug
因为磁盘热插时不能重启 qemu-kvm 所在的 Pod,所以 KubeVirt 不会更新 virt-launcher Pod 定义:
$ kubectl get po virt-launcher-ecs-test1-n4pl8 -n demo -o yaml | grep datavol-hotplug
但 VirtualMachine 和 VirtualMachineInstance 对象都会被更新,Volumes 列表中追加数据卷 datavol-hotplug:
$ kubectl get vm ecs-test1 -n demo -ojsonpath='{.spec.template.spec.volumes}' | jq
[
{
"name": "bootdisk",
"persistentVolumeClaim": {
"claimName": "ecs-test1-bootpvc-bgzrlh"
}
},
{
"cloudInitNoCloud": {
"userData": "#cloud-config\npassword: atomic\nssh_pwauth: True\nchpasswd: { expire: False }\nbootcmd:\n- \"setenforce 0\""
},
"name": "cloudinitdisk"
},
{
"name": "datavol-hotplug",
"persistentVolumeClaim": {
"claimName": "datavol-hotplug",
"hotpluggable": true
}
}
]
$ kubectl get vmi ecs-test1 -n demo -ojsonpath='{.spec.volumes}' | jq
[
{
"name": "bootdisk",
"persistentVolumeClaim": {
"claimName": "ecs-test1-bootpvc-bgzrlh"
}
},
{
"cloudInitNoCloud": {
"userData": "#cloud-config\npassword: atomic\nssh_pwauth: True\nchpasswd: { expire: False }\nbootcmd:\n- \"setenforce 0\""
},
"name": "cloudinitdisk"
},
{
"name": "datavol-hotplug",
"persistentVolumeClaim": {
"claimName": "datavol-hotplug",
"hotpluggable": true
}
}
]
virt-launcher Pod 同命名空间下,出现了一个新的 Pod(由 virt-controller 监听到 VMI 追加数据卷后创建):
$ kubectl get po -n demo
NAME READY STATUS RESTARTS AGE
hp-volume-kdpgl 1/1 Running 0 17m
virt-launcher-ecs-test1-n4pl8 1/1 Running 0 20m
hp-volume Pod 定义为:
$ kubectl get po hp-volume-kdpgl -n demo -o yaml
apiVersion: v1
kind: Pod
metadata:
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: true
controller: true
kind: Pod
name: virt-launcher-ecs-test1-n4pl8
uid: 7bf1ee01-f81f-4406-a500-b351a6115dd9
resourceVersion: "70797056"
uid: 90f44088-5dd0-4d4d-9ca4-d9fee4ef736a
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- bj-node171
containers:
- command:
- /bin/sh
- -c
- /usr/bin/container-disk --copy-path /path/hp
image: registry-1.ict-mec.net:18443/kubevirt/virt-launcher:v0.51.0-dev
name: hotplug-disk
volumeDevices:
- devicePath: /path/datavol-hotplug/368836ba-6f6a-4d07-b7ad-914b856642f6
name: datavol-hotplug
volumeMounts:
- mountPath: /path
mountPropagation: HostToContainer
name: hotplug-disks
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-drrz6
readOnly: true
volumes:
- emptyDir: {}
name: hotplug-disks
- name: datavol-hotplug
persistentVolumeClaim:
claimName: datavol-hotplug
- hp-volume Pod 使用了热插卷 PVC datavol-hotplug,在调度完成后,PVC 对应的 PV 将 attach 至该 Pod 所在的宿主机。
- 因为热插卷 PV 必须 attach 至 VMI 的落点,即 virt-launcher Pod 所在的节点,所以 hp-volume Pod 携带了 Node Affinity,指定 Pod 落点。
- hp-volume Pod 的 ownerReferences 被设置为 virt-launcher Pod,当实例被关机或删除,virt-launcher Pod 被 virt-controller 删除,该 Pod 会跟随 virt-launcher Pod 一并删除。
KubeVirt 架构请查看 https://blog.crazytaxii.com/posts/kubevirt_intro/ 和 https://github.com/kubevirt/kubevirt/blob/v0.51.0/docs/components.md。
virt-controller
hp-volume Pod 由 virt-controller 控制器监听到 VMI 追加数据卷后创建:
func (c *VMIController) createAttachmentPodTemplate(vmi *virtv1.VirtualMachineInstance, virtlauncherPod *k8sv1.Pod, volumes []*virtv1.Volume) (*k8sv1.Pod, error) {
logger := log.Log.Object(vmi)
var pod *k8sv1.Pod
var err error
volumeNamesPVCMap, err := kubevirttypes.VirtVolumesToPVCMap(volumes, c.pvcInformer.GetStore(), virtlauncherPod.Namespace)
if err != nil {
return nil, fmt.Errorf("failed to get PVC map: %v", err)
}
// a lot of code here
if len(volumeNamesPVCMap) > 0 {
pod, err = c.templateService.RenderHotplugAttachmentPodTemplate(volumes, virtlauncherPod, vmi, volumeNamesPVCMap, false)
}
return pod, err
}
比对并找出需要被热插的 PVC(即 datavol-hotplug),生成 hp-volume Pod 定义并创建:
func (c *VMIController) createAttachmentPod(vmi *virtv1.VirtualMachineInstance, virtLauncherPod *k8sv1.Pod, volumes []*virtv1.Volume) syncError {
attachmentPodTemplate, _ := c.createAttachmentPodTemplate(vmi, virtLauncherPod, volumes)
if attachmentPodTemplate == nil {
return nil
}
vmiKey := controller.VirtualMachineInstanceKey(vmi)
c.podExpectations.ExpectCreations(vmiKey, 1)
pod, err := c.clientset.CoreV1().Pods(vmi.GetNamespace()).Create(context.Background(), attachmentPodTemplate, v1.CreateOptions{})
if err != nil {
c.podExpectations.CreationObserved(vmiKey)
c.recorder.Eventf(vmi, k8sv1.EventTypeWarning, FailedCreatePodReason, "Error creating attachment pod: %v", err)
return &syncErrorImpl{fmt.Errorf("Error creating attachment pod %v", err), FailedCreatePodReason}
}
c.recorder.Eventf(vmi, k8sv1.EventTypeNormal, SuccessfulCreatePodReason, "Created attachment pod %s", pod.Name)
return nil
}
我们的重心在于找出 virt-launcher Pod 中的热插卷对应的块设备文件从何而来,弄清楚该设备如何进入 virt-launcher Pod 中。
回到 hp-volume Pod,运行了一个 container-disk 进程,这是一个 IPC 服务,创建出来一个 Unix Socket:
$ kubectl exec -it hp-volume-kdpgl -n demo -- ls -al /path
total 0
drwxrwxrwx 3 root root 44 Jul 6 03:06 .
drwxr-xr-x 1 root root 63 Jul 6 03:06 ..
drwxr-xr-x 2 root root 50 Jul 6 03:06 datavol-hotplug
srwxr-xr-x 1 root root 0 Jul 6 03:06 hp.sock
而 hp-volume Pod 中的 /path 路径是一个 EmptyDir,所以宿主机上也能够访问到该 usock,通过一定的规则拼接出来:
$ ls -al /var/lib/kubelet/pods/$(kubectl get po hp-volume-kdpgl -n demo -o jsonpath='{.metadata.uid}')/volumes/kubernetes.io~empty-dir/hotplug-disks/hp.sock
srwxr-xr-x 1 root root 0 Jul 6 11:06 /var/lib/kubelet/pods/90f44088-5dd0-4d4d-9ca4-d9fee4ef736a/volumes/kubernetes.io~empty-dir/hotplug-disks/hp.sock
virt-handler
virt-handler 组件本质上也是一个控制器,但这个 Pod 有很高的权限(privileged),在监听到 VirtualMachine 对象的更新事件后,会处理热插盘:
func (d *VirtualMachineController) processVmUpdate(vmi *v1.VirtualMachineInstance, domainExists bool) error {
isUnresponsive, isInitialized, err := d.isLauncherClientUnresponsive(vmi)
if err != nil {
return err
}
if !isInitialized {
d.Queue.AddAfter(controller.VirtualMachineInstanceKey(vmi), time.Second*1)
return nil
} else if isUnresponsive {
return goerror.New(fmt.Sprintf("Can not update a VirtualMachineInstance with unresponsive command server."))
}
d.handlePostMigrationProxyCleanup(vmi)
if d.isPreMigrationTarget(vmi) {
return d.vmUpdateHelperMigrationTarget(vmi)
} else if d.isMigrationSource(vmi) {
return d.vmUpdateHelperMigrationSource(vmi)
} else {
return d.vmUpdateHelperDefault(vmi, domainExists)
}
}
func (d *VirtualMachineController) vmUpdateHelperDefault(origVMI *v1.VirtualMachineInstance, domainExists bool) error {
// a lot of code here
// Try to mount hotplug volume if there is any during startup.
if err := d.hotplugVolumeMounter.Mount(vmi); err != nil {
return err
}
}
func (m *volumeMounter) Mount(vmi *v1.VirtualMachineInstance) error {
record, err := m.getMountTargetRecord(vmi)
if err != nil {
return err
}
for _, volumeStatus := range vmi.Status.VolumeStatus {
if volumeStatus.HotplugVolume == nil {
// Skip non hotplug volumes
continue
}
sourceUID := volumeStatus.HotplugVolume.AttachPodUID
if err := m.mountHotplugVolume(vmi, volumeStatus.Name, sourceUID, record); err != nil {
return err
}
}
return nil
}
func (m *volumeMounter) mountHotplugVolume(vmi *v1.VirtualMachineInstance, volumeName string, sourceUID types.UID, record *vmiMountTargetRecord) error {
logger := log.DefaultLogger()
logger.V(4).Infof("Hotplug check volume name: %s", volumeName)
if sourceUID != types.UID("") {
if m.isBlockVolume(&vmi.Status, volumeName) {
logger.V(4).Infof("Mounting block volume: %s", volumeName)
if err := m.mountBlockHotplugVolume(vmi, volumeName, sourceUID, record); err != nil {
return err
}
}
}
return nil
}
从 VMI 的 status 中拿到热插 Pod 即 hp-volume Pod,调用 mountHotplugVolume
方法:
func (m *volumeMounter) mountBlockHotplugVolume(vmi *v1.VirtualMachineInstance, volume string, sourceUID types.UID, record *vmiMountTargetRecord) error {
virtlauncherUID := m.findVirtlauncherUID(vmi)
if virtlauncherUID == "" {
// This is not the node the pod is running on.
return nil
}
targetPath, err := m.hotplugDiskManager.GetHotplugTargetPodPathOnHost(virtlauncherUID)
if err != nil {
return err
}
// a lot of core here
if isBlockExists, _ := isBlockDevice(deviceName); !isBlockExists {
sourceMajor, sourceMinor, permissions, err := m.getSourceMajorMinor(sourceUID, volume)
if err != nil {
return err
}
if err := m.writePathToMountRecord(deviceName, vmi, record); err != nil {
return err
}
// allow block devices
if err := m.allowBlockMajorMinor(sourceMajor, sourceMinor, cgroupsManager); err != nil {
return err
}
if _, err = m.createBlockDeviceFile(deviceName, sourceMajor, sourceMinor, permissions); err != nil {
return err
}
}
}
GetHotplugTargetPodPathOnHost
拼出 virt-launcher Pod 中热插数据卷所需的块文件的在宿主机上的路径getSourceMajorMinor
拿到 hp-volume Pod 中块设备文件即 datavol-hotplug 的 major 和 minor numbercreateBlockDeviceFile
在 virt-launcher mountns 中创建块文件 /path/to/datavol-hotplug
我们再来看下 createBlockDeviceFile
方法:
var mknodCommand = func(deviceName string, major, minor int64, blockDevicePermissions string) ([]byte, error) {
output, err := exec.Command("/usr/bin/mknod", "--mode", fmt.Sprintf("0%s", blockDevicePermissions), deviceName, "b", strconv.FormatInt(major, 10), strconv.FormatInt(minor, 10)).CombinedOutput()
log.Log.V(3).Infof("running mknod. err: %v, output: %s", err, string(output))
return output, err
}
func (m *volumeMounter) createBlockDeviceFile(deviceName string, major, minor int64, blockDevicePermissions string) (string, error) {
exists, err := diskutils.FileExists(deviceName)
if err != nil {
return "", err
}
if !exists {
out, err := mknodCommand(deviceName, major, minor, blockDevicePermissions)
if err != nil {
log.DefaultLogger().Errorf("Error creating block device file: %s, %v", out, err)
return "", err
}
}
return deviceName, nil
}
调用 mknod,参数为 major & minor number 创建一个块设备文件,这个设备实际上就是随 hp-volume Pod attach 至主机上的 rbd 盘。
Kubernetes 块模式 PV 原理请查看 https://blog.crazytaxii.com/posts/k8s_pv_block_mode/。
如此,宿主机上的 rbd 盘设备 /dev/rbd32,hp-volume Pod 中的块设备文件和 virt-launcher 中的块设备文件,指向的是同一个 rbd 卷,所以 KubeVirt 块模式的热插数据卷,理论上和创建实例时就定义的数据卷,性能一致。
Unix socket
最后我们再来探索一下 hp-volume Pod 中 container-disk app 创建的 usock 的作用:
$ kubectl exec -it hp-volume-kdpgl -n demo -- ls -al /path
total 0
drwxr-xr-x 2 root root 50 Jul 6 03:06 datavol-hotplug
srwxr-xr-x 1 root root 0 Jul 6 03:06 hp.sock
在 container-disk 的实现逻辑中,该 usock 无法连接。
var socketPath = func(podUID types.UID) string {
return fmt.Sprintf("pods/%s/volumes/kubernetes.io~empty-dir/hotplug-disks/hp.sock", string(podUID))
}
func (m *volumeMounter) getSourcePodFilePath(sourceUID types.UID, vmi *v1.VirtualMachineInstance, volume string) (string, error) {
iso := isolationDetector("/path")
isoRes, err := iso.DetectForSocket(vmi, socketPath(sourceUID))
if err != nil {
return "", err
}
findmounts, err := LookupFindmntInfoByVolume(volume, isoRes.Pid())
if err != nil {
return "", err
}
// a lot of code here
// Did not find the disk image file, return error
return "", fmt.Errorf("unable to find source disk image path for pod %s", sourceUID)
}
跳转到 DetectForSocket 方法:
func (s *socketBasedIsolationDetector) DetectForSocket(vm *v1.VirtualMachineInstance, socket string) (IsolationResult, error) {
var pid int
var ppid int
var err error
if pid, err = s.getPid(socket); err != nil {
log.Log.Object(vm).Reason(err).Errorf("Could not get owner Pid of socket %s", socket)
return nil, err
}
// a lot of code here
}
通过 Unix socket 来拿到其所有者进程(即 container-disk)的 Pid:
func (s *socketBasedIsolationDetector) getPid(socket string) (int, error) {
sock, err := net.DialTimeout("unix", socket, time.Duration(isolationDialTimeout)*time.Second)
if err != nil {
return -1, err
}
defer sock.Close()
ufile, err := sock.(*net.UnixConn).File()
if err != nil {
return -1, err
}
// This is the tricky part, which will give us the PID of the owning socket
ucreds, err := syscall.GetsockoptUcred(int(ufile.Fd()), syscall.SOL_SOCKET, syscall.SO_PEERCRED)
if err != nil {
return -1, err
}
if int(ucreds.Pid) == 0 {
return -1, fmt.Errorf("the detected PID is 0. Is the isolation detector running in the host PID namespace?")
}
return int(ucreds.Pid), nil
}
总结
KubeVirt 通过中间 Pod(hp-volume)来调用 CSI 将 PV attach 至 virt-launcher Pod 所在的主机,virt-handler(privileged)在 qemu-kvm 所在的 mount 命名空间内创建指向 rbd 块设备的块设备文件,在不重启 virt-launcher Pod 的前提下最大化地保证了 qemu-kvm 所使用的数据卷性能,可以说是非常巧妙。