KubeVirt KVM Real-Time 原理

Apr 18, 2024 19:30 · 3359 words · 7 minute read KubeVirt Virtualization Linux

KubeVirt v1.0.0 支持创建高敏感度的实时虚拟机,但因为 libvirtd(virtqemud)运行在容器中(Pod),KubeVirt 使用了一些“额外的手段”来实现这个能力。

使用方法

KubeVirt VirtualMachine CRD 通过以下字段来设置实时虚拟机:

  1. spec.domain.cpu.realtimeKubeVirt 会配置 Linux 调度器以 SCHED_FIFO 调度策略且最高优先级 1 来运行 VCPU 线程,确保虚机中的所有进程都会以实时的优先级来执行。
  2. spec.domain.cpu.realtime.mask:定义了虚机哪些 VCPU 是实时的。如果不设置,那么所有 VCPU 都会以 SCHED_FIFO 调度策略(属于实时调度类)且最高优先级 1(Linux 调度中数字越小优先级越高)来运行。

概括一下 SCHED_FIFO 调度策略:任务优先级一样的,先来先得;优先级更高的任务可以抢占低优先级的 CPU。
参考 https://kubevirt.io/user-guide/virtual_machines/numa/#running-real-time-workloads

Real-Time VirtualMachine 定义:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  # a lot of metadata
spec:
  running: true
  template:
    metadata:
      labels:
        kubevirt.io/vm: ecs-realtime
    spec:
      architecture: amd64
      domain:
        cpu:
          cores: 1
          dedicatedCpuPlacement: true
          model: host-passthrough
          numa:
            guestMappingPassthrough: {}
          realtime: {}
          sockets: 2
          threads: 1
        devices:
          disks:
          - bootOrder: 1
            disk:
              bus: virtio
            name: bootdisk
          - disk:
              bus: virtio
            name: cloudinitdisk
          interfaces:
          - bridge: {}
            name: attachnet1
        machine:
          type: q35
        memory:
          guest: 2Gi
          hugepages:
            pageSize: 2Mi
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: "2"
            memory: 2Gi
      hostname: ecs-realtime
      networks:
      - multus:
          networkName: mec-nets/attachnet1
        name: attachnet1
      volumes:
      - name: bootdisk
        persistentVolumeClaim:
          claimName: ecs-realtime-bootpvc-hlie0p
      - cloudInitConfigDrive:
          userData: |-
            #cloud-config
            user: root
            password: atomic
            ssh_pwauth: True
            chpasswd: { expire: False }            
        name: cloudinitdisk

前提条件

  1. Kubernetes(kubelet)开启 CPUManager 为 KubeVirt 虚机提供绑核能力:技术上 Real-Time 和绑核是两回事,并不耦合,但在绑核时效果更好,所以 KubeVirt 业务层面上限制了要为虚机设置 Real-Time 必须绑核。

    Kubernetes CPUManager 参考 https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/

  2. 虚机定义中开启 NUMA 拓扑映射 GuestMappingPassthrough同上 KubeVirt 在业务层面上限制

    KubeVirt NUMA 参考 https://kubevirt.io/user-guide/virtual_machines/numa/

  3. CPU 模型使用 host-passthrough,虚机能够直接看到宿主机 CPU(透传),而不屏蔽任何能力。

    这样设置虽然性能是最好的,但会大大降低热迁移的兼容性,只能迁移到与宿主机 CPU 完全一样的节点上。

    libvirt domain host-passthrough 参考 https://libvirt.org/formatdomain.html

  4. 虚机定义中配置大页:这又是开启 NUMA 拓扑映射的前提。

    KubeVirt 配置大页参考 https://kubevirt.io/user-guide/virtual_machines/virtual_hardware/#hugepages

  5. 节点必须允许运行 SCHED_FIFO 调度策略的进程(线程)

    查看并设置节点的 kernel.sched_rt_runtime_us 内核参数

    $ sysctl kernel.sched_rt_runtime_us
        kernel.sched_rt_runtime_us = 950000 # 默认值
    $ sysctl -w kernel.sched_rt_runtime_us=-1
    

    Linux Real-Time group scheduling 参考 https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt
    注意:950000(0.95s)为非实时调度(SCHED_OTHER)的任务留了 0.05s,为了防止实时任务失控并锁住机器,留一点时间来恢复。改成 -1 可能会导致系统出问题!

virt-handler 会检测所在节点的 kernel.sched_rt_runtime_us 内核参数,为符合预期的(值为 -1)的节点添加一个 kubevirt.io/realtime 标签:

https://github.com/kubevirt/kubevirt/blob/04a198e5a33cd1369e534f55b26920dce7776f69/pkg/virt-handler/node-labeller/node_labeller.go#L318-L324

func (n *NodeLabeller) prepareLabels(node *v1.Node, cpuModels []string, cpuFeatures cpuFeatures, hostCpuModel hostCPUModel, obsoleteCPUsx86 map[string]bool) map[string]string {
    // a lot of code here
    capable, err := isNodeRealtimeCapable()
    if err != nil {
        n.logger.Reason(err).Error("failed to identify if a node is capable of running realtime workloads")
    }
    if capable {
        newLabels[kubevirtv1.RealtimeLabel] = ""
    }
    // a lot of code here
}

// https://github.com/kubevirt/kubevirt/blob/04a198e5a33cd1369e534f55b26920dce7776f69/pkg/virt-handler/node-labeller/node_labeller.go#L367-L381

const kernelSchedRealtimeRuntimeInMicrosecods = "kernel.sched_rt_runtime_us"

func isNodeRealtimeCapable() (bool, error) {
    ret, err := exec.Command("sysctl", kernelSchedRealtimeRuntimeInMicrosecods).CombinedOutput()
    if err != nil {
        return false, err
    }
    st := strings.Trim(string(ret), "\n")
    return fmt.Sprintf("%s = -1", kernelSchedRealtimeRuntimeInMicrosecods) == st, nil
}

virt-handler 也是通过 systcl kernel.sched_rt_runtime_us 命令查看的节点内核参数。

virt-controller 在生成 Real-Time 类型 VM 的 virt-launcher Pod 时,会加上一个 NodeSelector,确保该 Pod 被调度到存在 kubevirt.io/realtime 标签的节点上:

https://github.com/kubevirt/kubevirt/blob/04a198e5a33cd1369e534f55b26920dce7776f69/pkg/virt-api/webhooks/mutating-webhook/mutators/vmi-mutator.go#L87-L90

    if newVMI.IsRealtimeEnabled() {
        log.Log.V(4).Info("Add realtime node label selector")
        addNodeSelector(newVMI, v1.RealtimeLabel)
    }
$ kubectl get po virt-launcher-ecs-realtime-wt2k8 -o jsonpath='{.spec.nodeSelector}' | jq
{
  "cpumanager": "true",
  "kubernetes.io/arch": "amd64",
  "kubevirt.io/realtime": "",
  "kubevirt.io/schedulable": "true"
}

$ kubectl get nodes -l "kubevirt.io/realtime"
NAME      STATUS   ROLES                  AGE    VERSION
node164   Ready    control-plane,worker   226d   v1.27.2

实现原理

根据 libvirt Real-Time 文档,为虚机(libvirt domain)定义 CPU tunning 字段来配置 Real-Time,例如:

<cputune>
  <emulatorpin cpuset="8-9"/>
  <vcpupin vcpu="0" cpuset="12"/>
  <vcpupin vcpu="1" cpuset="13"/>
  <vcpupin vcpu="2" cpuset="14"/>
  <vcpupin vcpu="3" cpuset="15"/>
  <vcpusched vcpus='0-4' scheduler='fifo' priority='1'/>
</cputune>
  • emulatorpin 表示为“模拟器”自己(这里就是 qemu-kvm)的线程绑定逻辑 CPU
  • vcpupin 表示为虚机的 VCPU 也就是模拟器的计算线程绑定逻辑 CPU
  • vcpusched 表示为虚机的 VCPU 也就是模拟器的计算线程指定调度策略和优先级

libvirt CPU tunning 参考 https://libvirt.org/formatdomain.html#cpu-tuning

查看 KubeVirt VM 所关联的 virt-launcher Pod 中的虚机进程(qemu-kvm):

$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- ps -ef
UID         PID   PPID  C STIME TTY          TIME CMD
qemu          1      0  0 09:09 ?        00:00:00 /usr/bin/virt-launcher-monitor
qemu         12      1  0 09:09 ?        00:00:07 /usr/bin/virt-launcher --qemu-
qemu         19     12  0 09:09 ?        00:00:08 /usr/sbin/virtqemud -f /var/ru
qemu         30     12  0 09:09 ?        00:00:01 /usr/sbin/virtlogd -f /etc/lib
qemu         71      1  0 09:09 ?        00:00:54 /usr/libexec/qemu-kvm -name gu

$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- ps -p 71 -L
PID    LWP TTY          TIME CMD
    71     71 ?        00:00:05 qemu-kvm
    71     73 ?        00:00:00 qemu-kvm
    71     74 ?        00:00:00 TC tc-ram-node0
    71     75 ?        00:00:00 IO iothread1
    71     78 ?        00:00:05 IO mon_iothread
    71     79 ?        00:00:30 CPU 0/KVM
    71     80 ?        00:00:12 CPU 1/KVM
    71     82 ?        00:00:00 vnc_worker

其中非计算和 IO 的线程都算是模拟器自己的。

还有其 libvirt domain 的定义:

$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- virsh dumpxml 1
<domain type='kvm' id='1'>
  <cputune>
    <vcpupin vcpu='0' cpuset='4'/>
    <vcpupin vcpu='1' cpuset='16'/>
  </cputune>
</domain>

但并没有像 CPU tunning 示例那样定义 vcpusched,这是因为 运行在 virt-launcher Pod 中的 virtqemud(KubeVirt v0.59.0 开始将 libvirtd 替换为 virtqemud)权限不够,无法根据 domain XML 的定义调整 qemu-kvm 计算线程的调度策略和优先级

对比开启 Real-Time 的 qemu-kvm(ecs-realtime)的计算线程和普通 qemu-kvm 的计算线程的调度策略和优先级:

# Real-Time VM
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- chrt -p 79
pid 79's current scheduling policy: SCHED_FIFO
pid 79's current scheduling priority: 1
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- chrt -p 80
pid 80's current scheduling policy: SCHED_FIFO
pid 80's current scheduling priority: 1

# 普通 VM
$ kubectl exec -it virt-launcher-ecs-test10-b5rtv -- chrt -p 76
pid 76's current scheduling policy: SCHED_OTHER
pid 76's current scheduling priority: 0
$ kubectl exec -it virt-launcher-ecs-test10-b5rtv -- chrt -p 77
pid 77's current scheduling policy: SCHED_OTHER
pid 77's current scheduling priority: 0

验证了定义 realtime: {} 的 KubeVirt VM 确实计算线程的调度策略为 SCHED_FIFO 且优先级为 1;普通 VM 的计算线程调度策略则是 SCHED_OTHER

先说结论,KubeVirt 通过 virt-handler 代替 libvirtd 来调整计算线程在 Linux 调度器中的调度策略和优先级virt-handlerprivileged 运行:

$ kubectl get ds virt-handler -n kubevirt -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq
{
  "privileged": true,
  "seLinuxOptions": {
    "level": "s0"
  }
}

Kubernetes SecurityContext 参考 https://kubernetes.io/docs/tasks/configure-pod-container/security-context/

我们接下来看 virt-handler 是如何处理的 https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/realtime.go#L37-L70

// configureRealTimeVCPUs parses the realtime mask value and configured the selected vcpus
// for real time workloads by setting the scheduler to FIFO and process priority equal to 1.
func (d *VirtualMachineController) configureVCPUScheduler(vmi *v1.VirtualMachineInstance) error {
    res, err := d.podIsolationDetector.Detect(vmi)
    if err != nil {
        return err
    }
    qemuProcess, err := res.GetQEMUProcess()
    if err != nil {
        return err
    }
    vcpus, err := getVCPUThreadIDs(qemuProcess.Pid())
    if err != nil {
        return err
    }
    mask, err := parseCPUMask(vmi.Spec.Domain.CPU.Realtime.Mask)
    if err != nil {
        return err
    }
    for vcpuID, threadID := range vcpus {
        if mask.isEnabled(vcpuID) {
            param := schedParam{priority: 1}
            tid, err := strconv.Atoi(threadID)
            if err != nil {
                return err
            }
            err = schedSetScheduler(tid, schedFIFO, param)
            if err != nil {
                return fmt.Errorf("failed to set FIFO scheduling and priority 1 for thread %d: %w", tid, err)
            }
        }
    }
    return nil
}
  1. virt-handler 先拿到 VMI 对应的 qemu-kvm 进程 PID

  2. 调用 getVCPUThreadIDs 函数,再读取 /proc/${qemu-pid}/task 找到 qemu-kvm 的计算线程

    https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/realtime.go#L79-L99

    func isVCPU(comm []byte) (string, bool) {
        if !vcpuRegex.MatchString(string(comm)) {
            return "", false
        }
        v := vcpuRegex.FindSubmatch(comm)
        return string(v[1]), true
    }
    
    func getVCPUThreadIDs(pid int) (map[string]string, error) {
        p := filepath.Join(string(os.PathSeparator), "proc", strconv.Itoa(pid), "task")
        d, err := os.ReadDir(p)
        if err != nil {
            return nil, err
        }
        ret := map[string]string{}
        for _, f := range d {
            if f.IsDir() {
                c, err := os.ReadFile(filepath.Join(p, f.Name(), "comm"))
                if err != nil {
                    return nil, err
                }
                if v, ok := isVCPU(c); ok {
                    ret[v] = f.Name()
                }
            }
        }
        return ret, nil
    }
    

    线程 comm 文件中的值满足 ^CPU (\d+)/KVM\n$ 正则表达式的就是计算线程,和我们用 ps -L -p 查看线程的效果类似。

    $ ll /proc/592916/task
    total 0
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592916
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592920
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592925
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592926
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592953
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592954
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592955
    dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592958
    
    $ cat /proc/592916/task/592954/comm
    CPU 0/KVM
    $ cat /proc/592916/task/592955/comm
    CPU 1/KVM
    
    $ ps -L -p 592916
    PID    LWP TTY          TIME CMD
    592916 592916 ?        00:00:30 qemu-kvm
    592916 592920 ?        00:00:00 qemu-kvm
    592916 592925 ?        00:00:00 TC tc-ram-node0
    592916 592926 ?        00:00:00 IO iothread1
    592916 592953 ?        00:00:45 IO mon_iothread
    592916 592954 ?        00:00:45 CPU 0/KVM
    592916 592955 ?        00:00:37 CPU 1/KVM
    592916 592958 ?        00:00:00 vnc_worker
    

    /proc API 参考:https://man7.org/linux/man-pages/man5/proc.5.html
    /proc/pid/comm (since Linux 2.6.33)
    This file exposes the process’s comm value—that is, the command name associated with the process. Differentthreads in the same process may have different comm values, accessible via /proc/pid/task/tid/comm. A thread may modify its comm value, or that of any of other thread in the same thread group (see the discussion of CLONE_THREAD in clone(2)), by writing to the file /proc/self/task/tid/comm.

  3. 调用 schedSetScheduler 函数将计算线程的调度策略设置为 SCHED_FIFO,并且调度优先级设置为 1

    https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/setsched.go#L24-L39

    const (
        // schedFIFO represents the Linux SCHED_FIFO scheduling policy ID:
        //
        // #define SCHED_FIFO  1
        //
        // Ref: https://github.com/torvalds/linux/blob/c2bf05db6c78f53ca5cd4b48f3b9b71f78d215f1/include/uapi/linux/sched.h#L115
        schedFIFO policy = 1
    )
    
    func schedSetScheduler(pid int, policy policy, param schedParam) error {
        _, _, e1 := unix.Syscall(unix.SYS_SCHED_SETSCHEDULER, uintptr(pid), uintptr(policy), uintptr(unsafe.Pointer(&param)))
        if e1 != 0 {
            return e1
        }
        return nil
    }
    

    通过 sched_setscheduler 系统调用来实现,因为 Go 没有封装它,那就要使用 golang.org/x/sys/unix 包的 Syscall 函数来搞。

    sched_setscheduler 参考 https://man7.org/linux/man-pages/man5/proc.5.html

virt-handler 另外还会为 KVM PIT 进程设置调度策略和优先级 https://github.com/kubevirt/kubevirt/blob/a8b752c2f2a3152f69b1faf2bb6af258fae7337c/pkg/virt-handler/vm.go#L2713-L2763

KVM PIT 为虚机模拟了 i8254 PIT(Programmable Interval Timer)设备,为客户机提供定时器。
参考 KVM API https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt
PIT timer interrupts may use a per-VM kernel thread for injection. If it exists, this thread will have a name of the following pattern:
kvm-pit/<owner-process-pid>
When running a guest with elevated priorities, the scheduling parameters of this thread may have to be adjusted accordingly.

func (d *VirtualMachineController) affinePitThread(vmi *v1.VirtualMachineInstance) error {
    res, err := d.podIsolationDetector.Detect(vmi)
    if err != nil {
        return err
    }
    // a lot of code here
    pitpid, err := res.KvmPitPid()
    if err != nil {
        return err
    }
    if pitpid == -1 {
        return nil
    }
    if vmi.IsRealtimeEnabled() {
        param := schedParam{priority: 2}
        err = schedSetScheduler(pitpid, schedFIFO, param)
        if err != nil {
            return fmt.Errorf("failed to set FIFO scheduling and priority 2 for thread %d: %w", pitpid, err)
        }
    }
}

和计算线程一样,通过 sched_setscheduler 系统调用将其调度策略设置为 SCHED_FIFO,但调度优先级是 2virt-handler 调用 KvmPitPid 方法拿到 qemu-kvm 进程对应的 KVM PIT 进程:

https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/isolation/isolation.go#L198-L216

func (r *RealIsolationResult) KvmPitPid() (int, error) {
    qemuprocess, err := r.GetQEMUProcess()
    if err != nil {
        return -1, err
    }
    processes, _ := ps.Processes()
    nspid, err := GetNspid(qemuprocess.Pid())
    if err != nil || nspid == -1 {
        return -1, err
    }
    pitstr := "kvm-pit/" + strconv.Itoa(nspid)

    for _, process := range processes {
        if process.Executable() == pitstr {
            return process.Pid(), nil
        }
    }
    return -1, nil
}

这里就比较暴力,遍历所有进程的执行命令,匹配 kvm-pit/${qemu-kvm-nspid} 来找到相应的 KVM PIT 进程,我们手动来模拟一下:

$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- ps -ef | grep qemu-kvm
qemu         71      1  0 Apr17 ?        00:03:09 /usr/libexec/qemu-kvm -name gu

$ ps -ef | grep "kvm-pit/71"
root     592956      2  0 Apr17 ?        00:00:00 [kvm-pit/71]

$ chrt -p 592956
pid 592956's current scheduling policy: SCHED_FIFO
pid 592956's current scheduling priority: 2

因为容器中的 virtqemud 进程(qemu 用户)无法在 virt-launcher Pod 中修改 qemu-kvm 计算线程和 KVM PIT 线程的调度策略和优先级,KubeVirt 曲线救国,利用 privileged 的 virt-handler 守护进程实现了原本 virtqemud 该做的事。

另外要注意:Real-Time 型的虚机需要实时内核,绝大部分 Linux 发行版的默认内核都不适用于低延迟的实时任务,要替换掉才行;或者选择特殊的 Linux 发行版。