KubeVirt UnexpectedAdmissionError 原理
Apr 12, 2024 00:30 · 3289 words · 7 minute read
一句话描述现象:KubeVirt VirtualMachineInstance 相关联的 virt-launcher Pod,在其宿主机重启后(VM 并未提前关机),可能阻塞在 UnexpectedAdmissionError 状态而无法被正常删除,导致 VM 无法按预期重启。
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
virt-launcher-ecs-test0-s5pgv 1/1 Running 0 13m
# reboot
virt-launcher-ecs-test0-s5pgv 0/1 UnexpectedAdmissionError 0 17m
KubeVirt 在宿主机重启后的处理逻辑
KubeVirt 的虚拟机实例(VMI)运行在 virt-launcher Pod 中,直接将 Pod 所在的节点关机 qemu-kvm 进程直接消失,相当于虚机断电。
当 virt-controller 中的 NodeController
发现节点不正常(心跳超时),就会将该节点上所有 VMI 的状态更新至 Failed
:
https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-controller/watch/node.go
virt-controller 中的 VMController
监听到 VMI 处于 Failed
状态,就会调用 stopVMI
方法将其删除(因为 finalizer 的存在,VMI 会先被赋予 deletionTimestamp
并继续存活一段时间):
https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-controller/watch/vm.go
virt-controller 中的 VMIController
监听到 VMI 处于删除状态(deletionTimestamp
有值),就会调用 deleteAllMatchingPods
方法去删除其管理的 virt-launcher Pod:
https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-controller/watch/vmi.go
出于篇幅考虑就不贴源码了。
所以在宿主机重启后,KubeVirt 会先删除该节点上原先所有 VMI 和 virt-launcher Pod,然后通过重建 VMI 和 virt-launcher Pod 来重启虚机。如果 VMI(virt-launcher Pod)阻塞在删除状态,因为一个 VirtualMachine 对应一个 VirtualMachineInstance,virt-controller 就无法重建 VMI,从而延迟虚机重启。
Device Plugin 框架
virt-controller 创建出来 virt-launcher Pod 默认额外使用三个“资源”:
$ kubectl get po virt-launcher-ecs-test0-w8srf -o jsonpath='{.spec.containers[0].resources}' | jq
{
"limits": {
"cpu": "2",
"devices.kubevirt.io/kvm": "1",
"devices.kubevirt.io/tun": "1",
"devices.kubevirt.io/vhost-net": "1",
"memory": "2302Mi"
},
"requests": {
"cpu": "2",
"devices.kubevirt.io/kvm": "1",
"devices.kubevirt.io/tun": "1",
"devices.kubevirt.io/vhost-net": "1",
"ephemeral-storage": "50M",
"memory": "2302Mi"
}
}
- devices.kubevirt.io/kvm
- devices.kubevirt.io/tun
- devices.kubevirt.io/vhost-net
再来看其所在节点:
$ kubectl get node mec52 -o jsonpath='{.status.capacity}' | jq
{
"cpu": "16",
"devices.kubevirt.io/kvm": "1k",
"devices.kubevirt.io/tun": "1k",
"devices.kubevirt.io/vhost-net": "1k",
"devices.kubevirt.io/vhost-vsock": "1k",
"ephemeral-storage": "82743276Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "32651168Ki",
"pods": "110"
}
同时这些“资源”也出现在了节点的状态中,这是由于 KubeVirt 使用了 Kubernetes 的 Device Plugin 框架:向节点注册设备,然后在 Pod 资源定义中携带设备,从而将 virt-launcher Pod 调度到这些资源可用的节点上。因为 device plugin server 能够判断其所在节点是否支持虚拟化,只会向有 kvm 的节点注册设备资源,所以确保 virt-launcher Pod 不会被调度到不支持虚拟化的节点上。
而 device plugin server 内置在了 virt-handler 中(以 DaemonSet 形式部署)https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/device-manager/device_controller.go:
func PermanentHostDevicePlugins(maxDevices int, permissions string) []Device {
var permanentDevicePluginPaths = map[string]string{
"kvm": "/dev/kvm",
"tun": "/dev/net/tun",
"vhost-net": "/dev/vhost-net",
}
ret := make([]Device, 0, len(permanentDevicePluginPaths))
for name, path := range permanentDevicePluginPaths {
ret = append(ret, NewGenericDevicePlugin(name, path, maxDevices, permissions, (name != "kvm")))
}
return ret
}
这三种 device plugin 会将设备资源走 gRPC 注册至 kubelet https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/device-manager/generic_device.go:
func NewGenericDevicePlugin(deviceName string, devicePath string, maxDevices int, permissions string, preOpen bool) *GenericDevicePlugin {
serverSock := SocketPath(deviceName)
dpi := &GenericDevicePlugin{
// a lot of code here
}
for i := 0; i < maxDevices; i++ {
deviceId := dpi.deviceName + strconv.Itoa(i)
dpi.devs = append(dpi.devs, &pluginapi.Device{
ID: deviceId,
Health: pluginapi.Healthy,
})
}
return dpi
}
maxDevices
默认为 1000,即在节点状态中看到的数量。
device plugin server 本质上是一个通过 Unix Domain Socket 来访问的 gRPC 服务器:
// Start starts the device plugin
func (dpi *GenericDevicePlugin) Start(stop <-chan struct{}) (err error) {
// a lot of code here
sock, err := net.Listen("unix", dpi.socketPath)
if err != nil {
return fmt.Errorf("error creating GRPC server socket: %v", err)
}
dpi.server = grpc.NewServer([]grpc.ServerOption{}...)
defer dpi.stopDevicePlugin()
pluginapi.RegisterDevicePluginServer(dpi.server, dpi)
errChan := make(chan error, 2)
go func() {
errChan <- dpi.server.Serve(sock)
}()
err = waitForGRPCServer(dpi.socketPath, connectionTimeout)
if err != nil {
return fmt.Errorf("error starting the GRPC server: %v", err)
}
err = dpi.register()
if err != nil {
return fmt.Errorf("error registering with device plugin manager: %v", err)
}
}
只需要在 device plugin server 中调用 Registration
API 向 kubelet 发送 Device Plugin 注册请求,告知其 endpoint(Unix Domain Socket)路径:
// Register registers the device plugin for the given resourceName with Kubelet.
func (dpi *GenericDevicePlugin) register() error {
// a lot of code here
client := pluginapi.NewRegistrationClient(conn)
reqt := &pluginapi.RegisterRequest{
Version: pluginapi.Version,
Endpoint: path.Base(dpi.socketPath),
ResourceName: dpi.resourceName,
}
_, err = client.Register(context.Background(), reqt)
if err != nil {
return err
}
return nil
}
遵循 Device Plugin 框架,device plugin server 的 Unix Domain Socket 需要创建在 /var/lib/kubelet/plugins 路径下让 kubelet 来访问,所以 virt-handler DaemonSet Pod 需要映射宿主机上的 /var/lib/kubelet/device-plugins 路径至容器中,其定义就不贴出来了:
ll /var/lib/kubelet/device-plugins
total 52
srwxr-xr-x 1 root root 0 Apr 7 15:12 kubelet.sock
-rw------- 1 root root 50395 Apr 10 18:13 kubelet_internal_checkpoint
srwxr-xr-x 1 root root 0 Apr 7 15:14 kubevirt-kvm.sock
srwxr-xr-x 1 root root 0 Apr 7 15:14 kubevirt-tun.sock
srwxr-xr-x 1 root root 0 Apr 7 15:14 kubevirt-vhost-net.sock
srwxr-xr-x 1 root root 0 Apr 7 15:14 kubevirt-vhost-vsock.sock
kvm、tun、vhost-net 分别是三个独立的 server。
在 device plugin server 中实现 ListAndWatch
handler,向 kubelet “发送”健康的设备:
func (dpi *GenericDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
s.Send(&pluginapi.ListAndWatchResponse{Devices: dpi.devs})
done := false
for {
select {
case devHealth := <-dpi.health:
// There's only one shared generic device
// so update each plugin device to reflect overall device health
for _, dev := range dpi.devs {
dev.Health = devHealth.Health
}
s.Send(&pluginapi.ListAndWatchResponse{Devices: dpi.devs})
}
// a lot of code here
}
Device Plugin 实现参考 https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-implementation
UnexpectedAdmissionError
当节点重启后,通过 kubectl describe
查看 virt-launcher Pod:
$ kubectl -n ns-5gc describe pod virt-launcher-ecs-smf-tbx8p
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning UnexpectedAdmissionError 32m kubelet Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected
Warning UnexpectedAdmissionError 20m kubelet Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected
Warning UnexpectedAdmissionError 12m kubelet Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected
Pod 长期阻塞在删除状态(有一定概率,并不是每次都这样),而警告事件由 kubelet 抛出 https://github.com/kubernetes/kubernetes/blob/v1.27.2/pkg/kubelet/cm/devicemanager/manager.go#L556-L559:
// Check if registered resource has healthy devices
if healthyDevices.Len() == 0 {
return nil, fmt.Errorf("no healthy devices present; cannot allocate unhealthy devices %s", resource)
}
这是由于 kubelet 内部 device manager 的 healthyDevices
表中并不存在 devices.kubevirt.io/vhost-net
设备。
以下两种情况会清空 healthyDevices
表中的设备数量:
-
kubelet 重启
// Reads device to container allocation information from disk, and populates // m.allocatedDevices accordingly. func (m *ManagerImpl) readCheckpoint() error { cp, err := m.getCheckpointV2() // a lot of code here m.mutex.Lock() defer m.mutex.Unlock() podDevices, registeredDevs := cp.GetDataInLatestFormat() m.podDevices.fromCheckpointData(podDevices) m.allocatedDevices = m.podDevices.devices() for resource := range registeredDevs { // During start up, creates empty healthyDevices list so that the resource capacity // will stay zero till the corresponding device plugin re-registers. m.healthyDevices[resource] = sets.NewString() // empty healthyDevices list m.unhealthyDevices[resource] = sets.NewString() m.endpoints[resource] = endpointInfo{e: newStoppedEndpointImpl(resource), opts: nil} } return nil }
重启后 kubelet 从 checkpoint 文件加载设备资源并清空资源的列表,这会导致
healthyDevices.Len() == 0
。 -
device plugin server 断开连接
// PluginDisconnected is to disconnect a plugin from an endpoint. // This is done as part of device plugin deregistration. func (m *ManagerImpl) PluginDisconnected(resourceName string) { m.mutex.Lock() defer m.mutex.Unlock() if _, exists := m.endpoints[resourceName]; exists { m.markResourceUnhealthy(resourceName) klog.V(2).InfoS("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", m.endpoints[resourceName]) } m.endpoints[resourceName].e.setStopTime(time.Now()) } func (m *ManagerImpl) markResourceUnhealthy(resourceName string) { klog.V(2).InfoS("Mark all resources Unhealthy for resource", "resourceName", resourceName) healthyDevices := sets.NewString() if _, ok := m.healthyDevices[resourceName]; ok { healthyDevices = m.healthyDevices[resourceName] m.healthyDevices[resourceName] = sets.NewString() // empty healthyDevices list } if _, ok := m.unhealthyDevices[resourceName]; !ok { m.unhealthyDevices[resourceName] = sets.NewString() } m.unhealthyDevices[resourceName] = m.unhealthyDevices[resourceName].Union(healthyDevices) }
查看此节点上 virt-handler 的日志:
$ kubectl logs virt-handler-j5hvw -n kubevirt | grep "device plugin"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: kvm","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610308Z"}
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: tun","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610366Z"}
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: vhost-net","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610534Z"}
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: vhost-vsock","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610948Z"}
{"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:345","timestamp":"2024-04-11T09:28:21.610973Z"}
{"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:345","timestamp":"2024-04-11T09:28:21.612061Z"}
{"component":"virt-handler","level":"info","msg":"kvm device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.614518Z"}
{"component":"virt-handler","level":"info","msg":"tun device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.629623Z"}
{"component":"virt-handler","level":"info","msg":"vhost-vsock device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.629999Z"}
{"component":"virt-handler","level":"info","msg":"vhost-net device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.646687Z"}
能够确认 device plugin server 已成功启动。
再来看该节点上 kubelet 的日志:
$ journalctl -u kubelet -r | grep "device plugin"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.645680 3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/vhost-net"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.628903 3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/vhost-vsock"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.628065 3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/tun"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.612774 3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/kvm"
device plugin 在 17:28:21 注册至 kubelet,接下来确认 UnexpectedAdmissionError 的时间点:
$ kubectl get events --field-selector reason=UnexpectedAdmissionError -o yaml | grep -i timestamp
creationTimestamp: "2024-04-11T09:26:17Z"
kubelet 最后一次尝试去删除 virt-launcher Pod 是在 09:26:17Z(17:26:17)。
梳理一下整个过程:在节点重启后,kubelet 先尝试去删除 virt-launcher Pod,但此时 virt-handler Pod 还未启动或启动成功,也就是 vhost-net 等 device plugin 还未将资源设备发送(注册)至 kubelet,kubelet 删除 virt-launcher Pod 失败抛出 UnexpectedAdmissionError 事件;大约两分钟后 device plugin 和 kubelet 交互成功,但此时 kubelet 已经不再尝试去删除了,从而 Pod 阻塞在删除状态。
因为 virt-handler 以 DaemonSet 形式部署,我们无法保证 virt-handler Pod 在 kubelet 删除 virt-launcher Pod 前启动。
Pod 回收机制
$ kubectl get po virt-launcher-ecs-test0-w8srf -o jsonpath='{.status}' | jq
{
"conditions": [
{
"lastProbeTime": "2024-04-10T10:13:18Z",
"lastTransitionTime": "2024-04-10T10:13:18Z",
"message": "the virtual machine is not paused",
"reason": "NotPaused",
"status": "True",
"type": "kubevirt.io/virtual-machine-unpaused"
}
],
"message": "Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected",
"phase": "Failed",
"reason": "UnexpectedAdmissionError",
"startTime": "2024-04-10T10:13:19Z"
}
处于 Failed
状态的 Pod,Kubernetes 不会主动回收,直到人工或程序明确地干预。Pod 垃圾回收(PodGC)只会清扫以下条件的 Pod:
- 孤儿 Pod,绑定的节点已不存在
- 未调度的处于终止状态的 Pod
- 处于终止状态的 Pod;且绑定的节点存在
node.kubernetes.io/out-of-service
污点;而且要开启NodeOutOfServiceVolumeDetach
feature gate
参考 https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection
但 virt-launcher Pod 不符合以上条件,会一直阻塞下去,导致 VM 无法重启。
解决方案
使用 kubectl delete pod --force
命令或修改 virt-controller 代码来强行删除 Failed
状态的 virt-launcher Pod 是错误的做法。因为强制删除 Pod 时,会跳过 cni 插件(比如 kube-ovn)的资源回收,导致重启后的虚机(尤其是调度至另一个节点)网络有问题(底层网络资源冲突)。
因为 virt-handler DaemonSet 是由 virt-operator 组件来管理的,我们想要保留其部署方式,目前一种可行的解决方案是去掉 virt-launcher Pod 中的相关资源设备:
resources:
requests:
cpu: "2"
# devices.kubevirt.io/kvm: "1"
# devices.kubevirt.io/tun: "1"
# devices.kubevirt.io/vhost-net: "1"
memory: "2302Mi"
limits:
cpu: "2"
# devices.kubevirt.io/kvm: "1"
# devices.kubevirt.io/tun: "1"
# devices.kubevirt.io/vhost-net: "1"
memory: "2302Mi"
通过 node affinity 来影响调度,避免 virt-launcher Pod 被调度至不支持虚拟化的节点上。当节点关机重启后,virt-launcher Pod 和普通 Pod 一样被 kubelet 删除,从而 virt-controller 得以重建 VMI 还有 virt-launcher Pod:
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
virt-launcher-ecs-test5-5glqn 1/1 Running 0 10m
# reboot
virt-launcher-ecs-test5-5glqn 0/1 Running 0 10m
virt-launcher-ecs-test5-5glqn 0/1 NodeAffinity 0 15m
virt-launcher-ecs-test5-5glqn 0/1 Terminating 0 15m
virt-launcher-ecs-test5-5glqn 0/1 Terminating 0 15m
virt-launcher-ecs-test5-b8njw 0/1 Pending 0 0s
virt-launcher-ecs-test5-b8njw 0/1 Pending 0 0s
virt-launcher-ecs-test5-b8njw 0/1 Pending 0 1s
virt-launcher-ecs-test5-b8njw 0/1 Pending 0 1s
virt-launcher-ecs-test5-b8njw 0/1 ContainerCreating 0 1s
virt-launcher-ecs-test5-b8njw 0/1 ContainerCreating 0 1s
virt-launcher-ecs-test5-b8njw 0/1 ContainerCreating 0 9s
virt-launcher-ecs-test5-b8njw 0/1 ContainerCreating 0 9s
virt-launcher-ecs-test5-b8njw 1/1 Running 0 1s
virt-launcher-ecs-test5-b8njw 1/1 Running 0 1s
一些思考
上述问题的根本原因在于 virt-launcher Pod 实际上是依赖 virt-handler 组件的,而且需要 virt-handler 中的 device plugin 向 kubelet 注册资源设备成功,但 Kubernetes 无法很好的处理这类 Pod(进程)依赖关系,我们无法控制 kubelet 在 device plugin 注册资源设备成功后才去管理某些 Pod 的生命周期,因为这两个过程是并行的。我们甚至不能使用 systemd 来让其在 kubelet 之前启动,因为 Device Plugin 框架需要 kubelet 的 Unix Domain Socket 来注册自己。虽然用 systemd 可以将 kubelet 的 Unix Domain Socket 出现的事件作为 device plugin server 启动的触发点,使其尽快向 kubelet 注册资源设备,但也只能降低上述问题的概率,无法根治。