KubeVirt KVM Real-Time 原理
Apr 18, 2024 19:30 · 3359 words · 7 minute read
KubeVirt v1.0.0 支持创建高敏感度的实时虚拟机,但因为 libvirtd(virtqemud)运行在容器中(Pod),KubeVirt 使用了一些“额外的手段”来实现这个能力。
使用方法
KubeVirt VirtualMachine CRD 通过以下字段来设置实时虚拟机:
spec.domain.cpu.realtime
:KubeVirt 会配置 Linux 调度器以SCHED_FIFO
调度策略且最高优先级 1 来运行 VCPU 线程,确保虚机中的所有进程都会以实时的优先级来执行。spec.domain.cpu.realtime.mask
:定义了虚机哪些 VCPU 是实时的。如果不设置,那么所有 VCPU 都会以SCHED_FIFO
调度策略(属于实时调度类)且最高优先级 1(Linux 调度中数字越小优先级越高)来运行。
概括一下 SCHED_FIFO 调度策略:任务优先级一样的,先来先得;优先级更高的任务可以抢占低优先级的 CPU。 参考 https://kubevirt.io/user-guide/virtual_machines/numa/#running-real-time-workloads
Real-Time VirtualMachine 定义:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
# a lot of metadata
spec:
running: true
template:
metadata:
labels:
kubevirt.io/vm: ecs-realtime
spec:
architecture: amd64
domain:
cpu:
cores: 1
dedicatedCpuPlacement: true
model: host-passthrough
numa:
guestMappingPassthrough: {}
realtime: {}
sockets: 2
threads: 1
devices:
disks:
- bootOrder: 1
disk:
bus: virtio
name: bootdisk
- disk:
bus: virtio
name: cloudinitdisk
interfaces:
- bridge: {}
name: attachnet1
machine:
type: q35
memory:
guest: 2Gi
hugepages:
pageSize: 2Mi
resources:
limits:
cpu: "2"
memory: 2Gi
requests:
cpu: "2"
memory: 2Gi
hostname: ecs-realtime
networks:
- multus:
networkName: mec-nets/attachnet1
name: attachnet1
volumes:
- name: bootdisk
persistentVolumeClaim:
claimName: ecs-realtime-bootpvc-hlie0p
- cloudInitConfigDrive:
userData: |-
#cloud-config
user: root
password: atomic
ssh_pwauth: True
chpasswd: { expire: False }
name: cloudinitdisk
前提条件
-
Kubernetes(kubelet)开启 CPUManager 为 KubeVirt 虚机提供绑核能力:技术上 Real-Time 和绑核是两回事,并不耦合,但在绑核时效果更好,所以 KubeVirt 业务层面上限制了要为虚机设置 Real-Time 必须绑核。
Kubernetes CPUManager 参考 https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/
-
虚机定义中开启 NUMA 拓扑映射
GuestMappingPassthrough
:同上 KubeVirt 在业务层面上限制。KubeVirt NUMA 参考 https://kubevirt.io/user-guide/virtual_machines/numa/
-
CPU 模型使用
host-passthrough
,虚机能够直接看到宿主机 CPU(透传),而不屏蔽任何能力。这样设置虽然性能是最好的,但会大大降低热迁移的兼容性,只能迁移到与宿主机 CPU 完全一样的节点上。
libvirt domain host-passthrough 参考 https://libvirt.org/formatdomain.html
-
虚机定义中配置大页:这又是开启 NUMA 拓扑映射的前提。
KubeVirt 配置大页参考 https://kubevirt.io/user-guide/virtual_machines/virtual_hardware/#hugepages
-
节点必须允许运行
SCHED_FIFO
调度策略的进程(线程):查看并设置节点的
kernel.sched_rt_runtime_us
内核参数$ sysctl kernel.sched_rt_runtime_us kernel.sched_rt_runtime_us = 950000 # 默认值 $ sysctl -w kernel.sched_rt_runtime_us=-1
Linux Real-Time group scheduling 参考 https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt 注意:950000(0.95s)为非实时调度(
SCHED_OTHER
)的任务留了 0.05s,为了防止实时任务失控并锁住机器,留一点时间来恢复。改成 -1 可能会导致系统出问题!
virt-handler 会检测所在节点的 kernel.sched_rt_runtime_us
内核参数,为符合预期的(值为 -1
)的节点添加一个 kubevirt.io/realtime
标签:
func (n *NodeLabeller) prepareLabels(node *v1.Node, cpuModels []string, cpuFeatures cpuFeatures, hostCpuModel hostCPUModel, obsoleteCPUsx86 map[string]bool) map[string]string {
// a lot of code here
capable, err := isNodeRealtimeCapable()
if err != nil {
n.logger.Reason(err).Error("failed to identify if a node is capable of running realtime workloads")
}
if capable {
newLabels[kubevirtv1.RealtimeLabel] = ""
}
// a lot of code here
}
// https://github.com/kubevirt/kubevirt/blob/04a198e5a33cd1369e534f55b26920dce7776f69/pkg/virt-handler/node-labeller/node_labeller.go#L367-L381
const kernelSchedRealtimeRuntimeInMicrosecods = "kernel.sched_rt_runtime_us"
func isNodeRealtimeCapable() (bool, error) {
ret, err := exec.Command("sysctl", kernelSchedRealtimeRuntimeInMicrosecods).CombinedOutput()
if err != nil {
return false, err
}
st := strings.Trim(string(ret), "\n")
return fmt.Sprintf("%s = -1", kernelSchedRealtimeRuntimeInMicrosecods) == st, nil
}
virt-handler 也是通过 systcl kernel.sched_rt_runtime_us
命令查看的节点内核参数。
virt-controller 在生成 Real-Time 类型 VM 的 virt-launcher Pod 时,会加上一个 NodeSelector,确保该 Pod 被调度到存在 kubevirt.io/realtime
标签的节点上:
if newVMI.IsRealtimeEnabled() {
log.Log.V(4).Info("Add realtime node label selector")
addNodeSelector(newVMI, v1.RealtimeLabel)
}
$ kubectl get po virt-launcher-ecs-realtime-wt2k8 -o jsonpath='{.spec.nodeSelector}' | jq
{
"cpumanager": "true",
"kubernetes.io/arch": "amd64",
"kubevirt.io/realtime": "",
"kubevirt.io/schedulable": "true"
}
$ kubectl get nodes -l "kubevirt.io/realtime"
NAME STATUS ROLES AGE VERSION
node164 Ready control-plane,worker 226d v1.27.2
实现原理
根据 libvirt Real-Time 文档,为虚机(libvirt domain)定义 CPU tunning 字段来配置 Real-Time,例如:
<cputune>
<emulatorpin cpuset="8-9"/>
<vcpupin vcpu="0" cpuset="12"/>
<vcpupin vcpu="1" cpuset="13"/>
<vcpupin vcpu="2" cpuset="14"/>
<vcpupin vcpu="3" cpuset="15"/>
<vcpusched vcpus='0-4' scheduler='fifo' priority='1'/>
</cputune>
emulatorpin
表示为“模拟器”自己(这里就是 qemu-kvm)的线程绑定逻辑 CPUvcpupin
表示为虚机的 VCPU 也就是模拟器的计算线程绑定逻辑 CPUvcpusched
表示为虚机的 VCPU 也就是模拟器的计算线程指定调度策略和优先级
libvirt CPU tunning 参考 https://libvirt.org/formatdomain.html#cpu-tuning
查看 KubeVirt VM 所关联的 virt-launcher Pod 中的虚机进程(qemu-kvm):
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- ps -ef
UID PID PPID C STIME TTY TIME CMD
qemu 1 0 0 09:09 ? 00:00:00 /usr/bin/virt-launcher-monitor
qemu 12 1 0 09:09 ? 00:00:07 /usr/bin/virt-launcher --qemu-
qemu 19 12 0 09:09 ? 00:00:08 /usr/sbin/virtqemud -f /var/ru
qemu 30 12 0 09:09 ? 00:00:01 /usr/sbin/virtlogd -f /etc/lib
qemu 71 1 0 09:09 ? 00:00:54 /usr/libexec/qemu-kvm -name gu
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- ps -p 71 -L
PID LWP TTY TIME CMD
71 71 ? 00:00:05 qemu-kvm
71 73 ? 00:00:00 qemu-kvm
71 74 ? 00:00:00 TC tc-ram-node0
71 75 ? 00:00:00 IO iothread1
71 78 ? 00:00:05 IO mon_iothread
71 79 ? 00:00:30 CPU 0/KVM
71 80 ? 00:00:12 CPU 1/KVM
71 82 ? 00:00:00 vnc_worker
其中非计算和 IO 的线程都算是模拟器自己的。
还有其 libvirt domain 的定义:
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- virsh dumpxml 1
<domain type='kvm' id='1'>
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='16'/>
</cputune>
</domain>
但并没有像 CPU tunning 示例那样定义 vcpusched
,这是因为 运行在 virt-launcher Pod 中的 virtqemud(KubeVirt v0.59.0 开始将 libvirtd 替换为 virtqemud)权限不够,无法根据 domain XML 的定义调整 qemu-kvm 计算线程的调度策略和优先级。
对比开启 Real-Time 的 qemu-kvm(ecs-realtime)的计算线程和普通 qemu-kvm 的计算线程的调度策略和优先级:
# Real-Time VM
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- chrt -p 79
pid 79's current scheduling policy: SCHED_FIFO
pid 79's current scheduling priority: 1
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- chrt -p 80
pid 80's current scheduling policy: SCHED_FIFO
pid 80's current scheduling priority: 1
# 普通 VM
$ kubectl exec -it virt-launcher-ecs-test10-b5rtv -- chrt -p 76
pid 76's current scheduling policy: SCHED_OTHER
pid 76's current scheduling priority: 0
$ kubectl exec -it virt-launcher-ecs-test10-b5rtv -- chrt -p 77
pid 77's current scheduling policy: SCHED_OTHER
pid 77's current scheduling priority: 0
验证了定义 realtime: {}
的 KubeVirt VM 确实计算线程的调度策略为 SCHED_FIFO
且优先级为 1;普通 VM 的计算线程调度策略则是 SCHED_OTHER
先说结论,KubeVirt 通过 virt-handler 代替 libvirtd 来调整计算线程在 Linux 调度器中的调度策略和优先级,virt-handler 以 privileged
运行:
$ kubectl get ds virt-handler -n kubevirt -o jsonpath='{.spec.template.spec.containers[0].securityContext}' | jq
{
"privileged": true,
"seLinuxOptions": {
"level": "s0"
}
}
Kubernetes SecurityContext 参考 https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
我们接下来看 virt-handler 是如何处理的 https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/realtime.go#L37-L70:
// configureRealTimeVCPUs parses the realtime mask value and configured the selected vcpus
// for real time workloads by setting the scheduler to FIFO and process priority equal to 1.
func (d *VirtualMachineController) configureVCPUScheduler(vmi *v1.VirtualMachineInstance) error {
res, err := d.podIsolationDetector.Detect(vmi)
if err != nil {
return err
}
qemuProcess, err := res.GetQEMUProcess()
if err != nil {
return err
}
vcpus, err := getVCPUThreadIDs(qemuProcess.Pid())
if err != nil {
return err
}
mask, err := parseCPUMask(vmi.Spec.Domain.CPU.Realtime.Mask)
if err != nil {
return err
}
for vcpuID, threadID := range vcpus {
if mask.isEnabled(vcpuID) {
param := schedParam{priority: 1}
tid, err := strconv.Atoi(threadID)
if err != nil {
return err
}
err = schedSetScheduler(tid, schedFIFO, param)
if err != nil {
return fmt.Errorf("failed to set FIFO scheduling and priority 1 for thread %d: %w", tid, err)
}
}
}
return nil
}
-
virt-handler 先拿到 VMI 对应的 qemu-kvm 进程 PID
-
调用
getVCPUThreadIDs
函数,再读取 /proc/${qemu-pid}/task 找到 qemu-kvm 的计算线程https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/realtime.go#L79-L99
func isVCPU(comm []byte) (string, bool) { if !vcpuRegex.MatchString(string(comm)) { return "", false } v := vcpuRegex.FindSubmatch(comm) return string(v[1]), true } func getVCPUThreadIDs(pid int) (map[string]string, error) { p := filepath.Join(string(os.PathSeparator), "proc", strconv.Itoa(pid), "task") d, err := os.ReadDir(p) if err != nil { return nil, err } ret := map[string]string{} for _, f := range d { if f.IsDir() { c, err := os.ReadFile(filepath.Join(p, f.Name(), "comm")) if err != nil { return nil, err } if v, ok := isVCPU(c); ok { ret[v] = f.Name() } } } return ret, nil }
线程 comm 文件中的值满足
^CPU (\d+)/KVM\n$
正则表达式的就是计算线程,和我们用ps -L -p
查看线程的效果类似。$ ll /proc/592916/task total 0 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592916 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592920 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592925 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592926 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592953 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592954 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592955 dr-xr-xr-x 7 qemu qemu 0 Apr 18 05:14 592958 $ cat /proc/592916/task/592954/comm CPU 0/KVM $ cat /proc/592916/task/592955/comm CPU 1/KVM $ ps -L -p 592916 PID LWP TTY TIME CMD 592916 592916 ? 00:00:30 qemu-kvm 592916 592920 ? 00:00:00 qemu-kvm 592916 592925 ? 00:00:00 TC tc-ram-node0 592916 592926 ? 00:00:00 IO iothread1 592916 592953 ? 00:00:45 IO mon_iothread 592916 592954 ? 00:00:45 CPU 0/KVM 592916 592955 ? 00:00:37 CPU 1/KVM 592916 592958 ? 00:00:00 vnc_worker
/proc API 参考:https://man7.org/linux/man-pages/man5/proc.5.html /proc/pid/comm (since Linux 2.6.33) This file exposes the process’s comm value—that is, the command name associated with the process. Differentthreads in the same process may have different comm values, accessible via /proc/pid/task/tid/comm. A thread may modify its comm value, or that of any of other thread in the same thread group (see the discussion of CLONE_THREAD in clone(2)), by writing to the file /proc/self/task/tid/comm.
-
调用
schedSetScheduler
函数将计算线程的调度策略设置为SCHED_FIFO
,并且调度优先级设置为1
https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/setsched.go#L24-L39
const ( // schedFIFO represents the Linux SCHED_FIFO scheduling policy ID: // // #define SCHED_FIFO 1 // // Ref: https://github.com/torvalds/linux/blob/c2bf05db6c78f53ca5cd4b48f3b9b71f78d215f1/include/uapi/linux/sched.h#L115 schedFIFO policy = 1 ) func schedSetScheduler(pid int, policy policy, param schedParam) error { _, _, e1 := unix.Syscall(unix.SYS_SCHED_SETSCHEDULER, uintptr(pid), uintptr(policy), uintptr(unsafe.Pointer(¶m))) if e1 != 0 { return e1 } return nil }
通过
sched_setscheduler
系统调用来实现,因为 Go 没有封装它,那就要使用golang.org/x/sys/unix
包的Syscall
函数来搞。sched_setscheduler 参考 https://man7.org/linux/man-pages/man5/proc.5.html
virt-handler 另外还会为 KVM PIT 进程设置调度策略和优先级 https://github.com/kubevirt/kubevirt/blob/a8b752c2f2a3152f69b1faf2bb6af258fae7337c/pkg/virt-handler/vm.go#L2713-L2763:
KVM PIT 为虚机模拟了 i8254 PIT(Programmable Interval Timer)设备,为客户机提供定时器。 参考 KVM API https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt PIT timer interrupts may use a per-VM kernel thread for injection. If it exists, this thread will have a name of the following pattern: kvm-pit/<owner-process-pid> When running a guest with elevated priorities, the scheduling parameters of this thread may have to be adjusted accordingly.
func (d *VirtualMachineController) affinePitThread(vmi *v1.VirtualMachineInstance) error {
res, err := d.podIsolationDetector.Detect(vmi)
if err != nil {
return err
}
// a lot of code here
pitpid, err := res.KvmPitPid()
if err != nil {
return err
}
if pitpid == -1 {
return nil
}
if vmi.IsRealtimeEnabled() {
param := schedParam{priority: 2}
err = schedSetScheduler(pitpid, schedFIFO, param)
if err != nil {
return fmt.Errorf("failed to set FIFO scheduling and priority 2 for thread %d: %w", pitpid, err)
}
}
}
和计算线程一样,通过 sched_setscheduler
系统调用将其调度策略设置为 SCHED_FIFO
,但调度优先级是 2
。virt-handler 调用 KvmPitPid
方法拿到 qemu-kvm 进程对应的 KVM PIT 进程:
https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/isolation/isolation.go#L198-L216
func (r *RealIsolationResult) KvmPitPid() (int, error) {
qemuprocess, err := r.GetQEMUProcess()
if err != nil {
return -1, err
}
processes, _ := ps.Processes()
nspid, err := GetNspid(qemuprocess.Pid())
if err != nil || nspid == -1 {
return -1, err
}
pitstr := "kvm-pit/" + strconv.Itoa(nspid)
for _, process := range processes {
if process.Executable() == pitstr {
return process.Pid(), nil
}
}
return -1, nil
}
这里就比较暴力,遍历所有进程的执行命令,匹配 kvm-pit/${qemu-kvm-nspid} 来找到相应的 KVM PIT 进程,我们手动来模拟一下:
$ kubectl exec -it virt-launcher-ecs-realtime-wt2k8 -- ps -ef | grep qemu-kvm
qemu 71 1 0 Apr17 ? 00:03:09 /usr/libexec/qemu-kvm -name gu
$ ps -ef | grep "kvm-pit/71"
root 592956 2 0 Apr17 ? 00:00:00 [kvm-pit/71]
$ chrt -p 592956
pid 592956's current scheduling policy: SCHED_FIFO
pid 592956's current scheduling priority: 2
因为容器中的 virtqemud 进程(qemu 用户)无法在 virt-launcher Pod 中修改 qemu-kvm 计算线程和 KVM PIT 线程的调度策略和优先级,KubeVirt 曲线救国,利用 privileged 的 virt-handler 守护进程实现了原本 virtqemud 该做的事。
另外要注意:Real-Time 型的虚机需要实时内核,绝大部分 Linux 发行版的默认内核都不适用于低延迟的实时任务,要替换掉才行;或者选择特殊的 Linux 发行版。