KubeVirt UnexpectedAdmissionError 原理

Apr 12, 2024 00:30 · 3289 words · 7 minute read KubeVirt Kubernetes

一句话描述现象:KubeVirt VirtualMachineInstance 相关联的 virt-launcher Pod,在其宿主机重启后(VM 并未提前关机),可能阻塞在 UnexpectedAdmissionError 状态而无法被正常删除,导致 VM 无法按预期重启。

$ kubectl get po -w
NAME                            READY   STATUS                     RESTARTS   AGE
virt-launcher-ecs-test0-s5pgv   1/1     Running                    0          13m
# reboot
virt-launcher-ecs-test0-s5pgv   0/1     UnexpectedAdmissionError   0          17m

KubeVirt 在宿主机重启后的处理逻辑

KubeVirt 的虚拟机实例(VMI)运行在 virt-launcher Pod 中,直接将 Pod 所在的节点关机 qemu-kvm 进程直接消失,相当于虚机断电。

当 virt-controller 中的 NodeController 发现节点不正常(心跳超时),就会将该节点上所有 VMI 的状态更新至 Failed

https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-controller/watch/node.go

virt-controller 中的 VMController 监听到 VMI 处于 Failed 状态,就会调用 stopVMI 方法将其删除(因为 finalizer 的存在,VMI 会先被赋予 deletionTimestamp 并继续存活一段时间):

https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-controller/watch/vm.go

virt-controller 中的 VMIController 监听到 VMI 处于删除状态(deletionTimestamp 有值),就会调用 deleteAllMatchingPods 方法去删除其管理的 virt-launcher Pod:

https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-controller/watch/vmi.go

出于篇幅考虑就不贴源码了。

所以在宿主机重启后,KubeVirt 会先删除该节点上原先所有 VMI 和 virt-launcher Pod,然后通过重建 VMI 和 virt-launcher Pod 来重启虚机。如果 VMI(virt-launcher Pod)阻塞在删除状态,因为一个 VirtualMachine 对应一个 VirtualMachineInstance,virt-controller 就无法重建 VMI,从而延迟虚机重启。

Device Plugin 框架

virt-controller 创建出来 virt-launcher Pod 默认额外使用三个“资源”:

$ kubectl get po virt-launcher-ecs-test0-w8srf -o jsonpath='{.spec.containers[0].resources}' | jq
{
  "limits": {
    "cpu": "2",
    "devices.kubevirt.io/kvm": "1",
    "devices.kubevirt.io/tun": "1",
    "devices.kubevirt.io/vhost-net": "1",
    "memory": "2302Mi"
  },
  "requests": {
    "cpu": "2",
    "devices.kubevirt.io/kvm": "1",
    "devices.kubevirt.io/tun": "1",
    "devices.kubevirt.io/vhost-net": "1",
    "ephemeral-storage": "50M",
    "memory": "2302Mi"
  }
}
  • devices.kubevirt.io/kvm
  • devices.kubevirt.io/tun
  • devices.kubevirt.io/vhost-net

再来看其所在节点:

$ kubectl get node mec52 -o jsonpath='{.status.capacity}' | jq
{
  "cpu": "16",
  "devices.kubevirt.io/kvm": "1k",
  "devices.kubevirt.io/tun": "1k",
  "devices.kubevirt.io/vhost-net": "1k",
  "devices.kubevirt.io/vhost-vsock": "1k",
  "ephemeral-storage": "82743276Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "32651168Ki",
  "pods": "110"
}

同时这些“资源”也出现在了节点的状态中,这是由于 KubeVirt 使用了 Kubernetes 的 Device Plugin 框架:向节点注册设备,然后在 Pod 资源定义中携带设备,从而将 virt-launcher Pod 调度到这些资源可用的节点上。因为 device plugin server 能够判断其所在节点是否支持虚拟化,只会向有 kvm 的节点注册设备资源,所以确保 virt-launcher Pod 不会被调度到不支持虚拟化的节点上。

而 device plugin server 内置在了 virt-handler 中(以 DaemonSet 形式部署)https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/device-manager/device_controller.go

func PermanentHostDevicePlugins(maxDevices int, permissions string) []Device {
    var permanentDevicePluginPaths = map[string]string{
        "kvm":       "/dev/kvm",
        "tun":       "/dev/net/tun",
        "vhost-net": "/dev/vhost-net",
    }

    ret := make([]Device, 0, len(permanentDevicePluginPaths))
    for name, path := range permanentDevicePluginPaths {
        ret = append(ret, NewGenericDevicePlugin(name, path, maxDevices, permissions, (name != "kvm")))
    }
    return ret
}

这三种 device plugin 会将设备资源走 gRPC 注册至 kubelet https://github.com/kubevirt/kubevirt/blob/v1.0.0/pkg/virt-handler/device-manager/generic_device.go

func NewGenericDevicePlugin(deviceName string, devicePath string, maxDevices int, permissions string, preOpen bool) *GenericDevicePlugin {
    serverSock := SocketPath(deviceName)
    dpi := &GenericDevicePlugin{
        // a lot of code here
    }

    for i := 0; i < maxDevices; i++ {
        deviceId := dpi.deviceName + strconv.Itoa(i)
        dpi.devs = append(dpi.devs, &pluginapi.Device{
            ID:     deviceId,
            Health: pluginapi.Healthy,
        })
    }

    return dpi
}

maxDevices 默认为 1000,即在节点状态中看到的数量。

device plugin server 本质上是一个通过 Unix Domain Socket 来访问的 gRPC 服务器:

// Start starts the device plugin
func (dpi *GenericDevicePlugin) Start(stop <-chan struct{}) (err error) {
    // a lot of code here
    sock, err := net.Listen("unix", dpi.socketPath)
    if err != nil {
        return fmt.Errorf("error creating GRPC server socket: %v", err)
    }

    dpi.server = grpc.NewServer([]grpc.ServerOption{}...)
    defer dpi.stopDevicePlugin()

    pluginapi.RegisterDevicePluginServer(dpi.server, dpi)

    errChan := make(chan error, 2)

    go func() {
        errChan <- dpi.server.Serve(sock)
    }()

    err = waitForGRPCServer(dpi.socketPath, connectionTimeout)
    if err != nil {
        return fmt.Errorf("error starting the GRPC server: %v", err)
    }

    err = dpi.register()
    if err != nil {
        return fmt.Errorf("error registering with device plugin manager: %v", err)
    }
}

只需要在 device plugin server 中调用 Registration API 向 kubelet 发送 Device Plugin 注册请求,告知其 endpoint(Unix Domain Socket)路径:

// Register registers the device plugin for the given resourceName with Kubelet.
func (dpi *GenericDevicePlugin) register() error {
    // a lot of code here
    client := pluginapi.NewRegistrationClient(conn)
    reqt := &pluginapi.RegisterRequest{
        Version:      pluginapi.Version,
        Endpoint:     path.Base(dpi.socketPath),
        ResourceName: dpi.resourceName,
    }

    _, err = client.Register(context.Background(), reqt)
    if err != nil {
        return err
    }
    return nil
}

遵循 Device Plugin 框架,device plugin server 的 Unix Domain Socket 需要创建在 /var/lib/kubelet/plugins 路径下让 kubelet 来访问,所以 virt-handler DaemonSet Pod 需要映射宿主机上的 /var/lib/kubelet/device-plugins 路径至容器中,其定义就不贴出来了:

ll /var/lib/kubelet/device-plugins
total 52
srwxr-xr-x 1 root root     0 Apr  7 15:12 kubelet.sock
-rw------- 1 root root 50395 Apr 10 18:13 kubelet_internal_checkpoint
srwxr-xr-x 1 root root     0 Apr  7 15:14 kubevirt-kvm.sock
srwxr-xr-x 1 root root     0 Apr  7 15:14 kubevirt-tun.sock
srwxr-xr-x 1 root root     0 Apr  7 15:14 kubevirt-vhost-net.sock
srwxr-xr-x 1 root root     0 Apr  7 15:14 kubevirt-vhost-vsock.sock

kvm、tun、vhost-net 分别是三个独立的 server。

在 device plugin server 中实现 ListAndWatch handler,向 kubelet “发送”健康的设备:

func (dpi *GenericDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    s.Send(&pluginapi.ListAndWatchResponse{Devices: dpi.devs})

    done := false
    for {
        select {
        case devHealth := <-dpi.health:
            // There's only one shared generic device
            // so update each plugin device to reflect overall device health
            for _, dev := range dpi.devs {
                dev.Health = devHealth.Health
            }
            s.Send(&pluginapi.ListAndWatchResponse{Devices: dpi.devs})
    }
    // a lot of code here
}

Device Plugin 实现参考 https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-implementation

UnexpectedAdmissionError

当节点重启后,通过 kubectl describe 查看 virt-launcher Pod:

$ kubectl -n ns-5gc describe pod virt-launcher-ecs-smf-tbx8p
Events:
  Type     Reason                    Age   From     Message
  ----     ------                    ----  ----     -------
  Warning  UnexpectedAdmissionError  32m   kubelet  Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected
  Warning  UnexpectedAdmissionError  20m   kubelet  Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected
  Warning  UnexpectedAdmissionError  12m   kubelet  Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected

Pod 长期阻塞在删除状态(有一定概率,并不是每次都这样),而警告事件由 kubelet 抛出 https://github.com/kubernetes/kubernetes/blob/v1.27.2/pkg/kubelet/cm/devicemanager/manager.go#L556-L559

    // Check if registered resource has healthy devices
    if healthyDevices.Len() == 0 {
        return nil, fmt.Errorf("no healthy devices present; cannot allocate unhealthy devices %s", resource)
    }

这是由于 kubelet 内部 device managerhealthyDevices 表中并不存在 devices.kubevirt.io/vhost-net 设备。

以下两种情况会清空 healthyDevices 表中的设备数量:

  1. kubelet 重启

    https://github.com/kubernetes/kubernetes/blob/v1.27.2/pkg/kubelet/cm/devicemanager/manager.go#L444-L484

    // Reads device to container allocation information from disk, and populates
    // m.allocatedDevices accordingly.
    func (m *ManagerImpl) readCheckpoint() error {
        cp, err := m.getCheckpointV2()
        // a lot of code here
    
        m.mutex.Lock()
        defer m.mutex.Unlock()
        podDevices, registeredDevs := cp.GetDataInLatestFormat()
        m.podDevices.fromCheckpointData(podDevices)
        m.allocatedDevices = m.podDevices.devices()
        for resource := range registeredDevs {
            // During start up, creates empty healthyDevices list so that the resource capacity
            // will stay zero till the corresponding device plugin re-registers.
            m.healthyDevices[resource] = sets.NewString() // empty healthyDevices list
            m.unhealthyDevices[resource] = sets.NewString()
            m.endpoints[resource] = endpointInfo{e: newStoppedEndpointImpl(resource), opts: nil}
        }
        return nil
    }
    

    重启后 kubelet 从 checkpoint 文件加载设备资源并清空资源的列表,这会导致 healthyDevices.Len() == 0

  2. device plugin server 断开连接

    https://github.com/kubernetes/kubernetes/blob/v1.27.2/pkg/kubelet/cm/devicemanager/manager.go#L222-L234

    // PluginDisconnected is to disconnect a plugin from an endpoint.
    // This is done as part of device plugin deregistration.
    func (m *ManagerImpl) PluginDisconnected(resourceName string) {
        m.mutex.Lock()
        defer m.mutex.Unlock()
    
        if _, exists := m.endpoints[resourceName]; exists {
            m.markResourceUnhealthy(resourceName)
            klog.V(2).InfoS("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", m.endpoints[resourceName])
        }
    
        m.endpoints[resourceName].e.setStopTime(time.Now())
    }
    
    func (m *ManagerImpl) markResourceUnhealthy(resourceName string) {
        klog.V(2).InfoS("Mark all resources Unhealthy for resource", "resourceName", resourceName)
        healthyDevices := sets.NewString()
        if _, ok := m.healthyDevices[resourceName]; ok {
            healthyDevices = m.healthyDevices[resourceName]
            m.healthyDevices[resourceName] = sets.NewString() // empty healthyDevices list
        }
        if _, ok := m.unhealthyDevices[resourceName]; !ok {
            m.unhealthyDevices[resourceName] = sets.NewString()
        }
        m.unhealthyDevices[resourceName] = m.unhealthyDevices[resourceName].Union(healthyDevices)
    }
    

查看此节点上 virt-handler 的日志:

$ kubectl logs virt-handler-j5hvw -n kubevirt | grep "device plugin"
Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: kvm","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610308Z"}
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: tun","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610366Z"}
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: vhost-net","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610534Z"}
{"component":"virt-handler","level":"info","msg":"Starting a device plugin for device: vhost-vsock","pos":"device_controller.go:58","timestamp":"2024-04-11T09:28:21.610948Z"}
{"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:345","timestamp":"2024-04-11T09:28:21.610973Z"}
{"component":"virt-handler","level":"info","msg":"refreshed device plugins for permitted/forbidden host devices","pos":"device_controller.go:345","timestamp":"2024-04-11T09:28:21.612061Z"}
{"component":"virt-handler","level":"info","msg":"kvm device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.614518Z"}
{"component":"virt-handler","level":"info","msg":"tun device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.629623Z"}
{"component":"virt-handler","level":"info","msg":"vhost-vsock device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.629999Z"}
{"component":"virt-handler","level":"info","msg":"vhost-net device plugin started","pos":"generic_device.go:161","timestamp":"2024-04-11T09:28:21.646687Z"}

能够确认 device plugin server 已成功启动。

再来看该节点上 kubelet 的日志:

$ journalctl -u kubelet -r | grep "device plugin"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.645680    3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/vhost-net"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.628903    3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/vhost-vsock"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.628065    3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/tun"
Apr 11 17:28:21 mec52 kubelet[3886]: I0411 17:28:21.612774    3886 server.go:144] "Got registration request from device plugin with resource" resourceName="devices.kubevirt.io/kvm"

device plugin 在 17:28:21 注册至 kubelet,接下来确认 UnexpectedAdmissionError 的时间点:

$ kubectl get events --field-selector reason=UnexpectedAdmissionError -o yaml | grep -i timestamp
    creationTimestamp: "2024-04-11T09:26:17Z"

kubelet 最后一次尝试去删除 virt-launcher Pod 是在 09:26:17Z(17:26:17)。

梳理一下整个过程:在节点重启后,kubelet 先尝试去删除 virt-launcher Pod,但此时 virt-handler Pod 还未启动或启动成功,也就是 vhost-net 等 device plugin 还未将资源设备发送(注册)至 kubelet,kubelet 删除 virt-launcher Pod 失败抛出 UnexpectedAdmissionError 事件;大约两分钟后 device plugin 和 kubelet 交互成功,但此时 kubelet 已经不再尝试去删除了,从而 Pod 阻塞在删除状态。

因为 virt-handler 以 DaemonSet 形式部署,我们无法保证 virt-handler Pod 在 kubelet 删除 virt-launcher Pod 前启动

Pod 回收机制

$ kubectl get po virt-launcher-ecs-test0-w8srf -o jsonpath='{.status}' | jq
{
  "conditions": [
    {
      "lastProbeTime": "2024-04-10T10:13:18Z",
      "lastTransitionTime": "2024-04-10T10:13:18Z",
      "message": "the virtual machine is not paused",
      "reason": "NotPaused",
      "status": "True",
      "type": "kubevirt.io/virtual-machine-unpaused"
    }
  ],
  "message": "Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected",
  "phase": "Failed",
  "reason": "UnexpectedAdmissionError",
  "startTime": "2024-04-10T10:13:19Z"
}

处于 Failed 状态的 Pod,Kubernetes 不会主动回收,直到人工或程序明确地干预。Pod 垃圾回收(PodGC)只会清扫以下条件的 Pod:

  • 孤儿 Pod,绑定的节点已不存在
  • 未调度的处于终止状态的 Pod
  • 处于终止状态的 Pod;且绑定的节点存在 node.kubernetes.io/out-of-service 污点;而且要开启 NodeOutOfServiceVolumeDetach feature gate

参考 https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection

但 virt-launcher Pod 不符合以上条件,会一直阻塞下去,导致 VM 无法重启

解决方案

使用 kubectl delete pod --force 命令或修改 virt-controller 代码来强行删除 Failed 状态的 virt-launcher Pod 是错误的做法。因为强制删除 Pod 时,会跳过 cni 插件(比如 kube-ovn)的资源回收,导致重启后的虚机(尤其是调度至另一个节点)网络有问题(底层网络资源冲突)。

因为 virt-handler DaemonSet 是由 virt-operator 组件来管理的,我们想要保留其部署方式,目前一种可行的解决方案是去掉 virt-launcher Pod 中的相关资源设备:

resources:
  requests:
    cpu: "2"
    # devices.kubevirt.io/kvm: "1"
    # devices.kubevirt.io/tun: "1"
    # devices.kubevirt.io/vhost-net: "1"
    memory: "2302Mi"
  limits:
    cpu: "2"
    # devices.kubevirt.io/kvm: "1"
    # devices.kubevirt.io/tun: "1"
    # devices.kubevirt.io/vhost-net: "1"
    memory: "2302Mi"

通过 node affinity 来影响调度,避免 virt-launcher Pod 被调度至不支持虚拟化的节点上。当节点关机重启后,virt-launcher Pod 和普通 Pod 一样被 kubelet 删除,从而 virt-controller 得以重建 VMI 还有 virt-launcher Pod:

$ kubectl get po -w
NAME                            READY   STATUS                     RESTARTS   AGE
virt-launcher-ecs-test5-5glqn   1/1     Running                    0          10m
# reboot
virt-launcher-ecs-test5-5glqn   0/1     Running                    0          10m
virt-launcher-ecs-test5-5glqn   0/1     NodeAffinity               0          15m
virt-launcher-ecs-test5-5glqn   0/1     Terminating                0          15m
virt-launcher-ecs-test5-5glqn   0/1     Terminating                0          15m
virt-launcher-ecs-test5-b8njw   0/1     Pending                    0          0s
virt-launcher-ecs-test5-b8njw   0/1     Pending                    0          0s
virt-launcher-ecs-test5-b8njw   0/1     Pending                    0          1s
virt-launcher-ecs-test5-b8njw   0/1     Pending                    0          1s
virt-launcher-ecs-test5-b8njw   0/1     ContainerCreating          0          1s
virt-launcher-ecs-test5-b8njw   0/1     ContainerCreating          0          1s
virt-launcher-ecs-test5-b8njw   0/1     ContainerCreating          0          9s
virt-launcher-ecs-test5-b8njw   0/1     ContainerCreating          0          9s
virt-launcher-ecs-test5-b8njw   1/1     Running                    0          1s
virt-launcher-ecs-test5-b8njw   1/1     Running                    0          1s

一些思考

上述问题的根本原因在于 virt-launcher Pod 实际上是依赖 virt-handler 组件的,而且需要 virt-handler 中的 device plugin 向 kubelet 注册资源设备成功,但 Kubernetes 无法很好的处理这类 Pod(进程)依赖关系,我们无法控制 kubelet 在 device plugin 注册资源设备成功后才去管理某些 Pod 的生命周期,因为这两个过程是并行的。我们甚至不能使用 systemd 来让其在 kubelet 之前启动,因为 Device Plugin 框架需要 kubelet 的 Unix Domain Socket 来注册自己。虽然用 systemd 可以将 kubelet 的 Unix Domain Socket 出现的事件作为 device plugin server 启动的触发点,使其尽快向 kubelet 注册资源设备,但也只能降低上述问题的概率,无法根治。