Kubernetes 节点生命周期控制器

Feb 9, 2025 00:00 · 2796 words · 6 minute read Kubernetes Golang

OpenYurt 项目拥有边缘自治功能:在云边网络断连时,边缘节点重启或者业务容器重启的情况下,业务容器能够在边缘节点自动恢复,而不会被云端的 kube-controller-manager 驱逐、重新调度。开启该功能需要关闭 kube-controller-manager 中的 node-lifecycle-controller,即在启动参数中带上 --controllers=-node-lifecycle-controller,开启 YurtManager 中自己实现的 node-lifecycle-controller,从而实现其边缘自治功能的业务逻辑。

kube-controller-manager 中包含了许多控制器,其中的 node-lifecycle-controller 负责监控节点的健康状态,当节点不健康时会驱逐 Pod。

节点心跳

在分布式系统中,通常通过定时的“心跳”来表示某对象是否处于存活状态。心跳有各种实现方法,比如定期更新/申请租约,而 Kubernetes 中的节点(Node)心跳则是由通过 kubelet 定期更新 Node 状态中的 conditions 列表中的 Ready 条目来表示

$ k get node vtester0 -o json | jq '.status.conditions[] | select(.type == "Ready")'
{
  "lastHeartbeatTime": "2025-02-02T15:32:41Z",
  "lastTransitionTime": "2024-12-31T08:54:57Z",
  "message": "kubelet is posting ready status",
  "reason": "KubeletReady",
  "status": "True",
  "type": "Ready"
}

kubelet 默认的节点状态更新周期为 5 分钟,可通过修改配置文件 /var/lib/kubelet/config.yaml 中的 nodeStatusUpdateFrequency 字段来自定义。嵌套深,请自行阅读该部分源码。

https://github.com/kubernetes/kubernetes/blob/1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b/pkg/kubelet/apis/config/v1beta1/defaults.go#L55-L267

func SetDefaults_KubeletConfiguration(obj *kubeletconfigv1beta1.KubeletConfiguration) {
    // a lot of code here
    if obj.NodeStatusReportFrequency == zeroDuration {
        // For backward compatibility, NodeStatusReportFrequency's default value is
        // set to NodeStatusUpdateFrequency if NodeStatusUpdateFrequency is set
        // explicitly.
        if obj.NodeStatusUpdateFrequency == zeroDuration {
            obj.NodeStatusReportFrequency = metav1.Duration{Duration: 5 * time.Minute}
        } else {
            obj.NodeStatusReportFrequency = obj.NodeStatusUpdateFrequency
        }
    }
}

当各种原因导致 kubelet 挂掉,无法及时更新 conditions 列表中的 Ready 条目,kubectl get node 命令会显示节点状态为 NotReady,这时我们发现节点被打上了两个污点,状态 conditions 列表中的所有条目都被更新为 NodeStatusUnknown

$ kubectl get node vtester1 -w
NAME       STATUS   ROLES    AGE   VERSION
vtester1   Ready    <none>   38d   v1.26.10
vtester1   NotReady   <none>   38d   v1.26.10
vtester1   NotReady   <none>   38d   v1.26.10
vtester1   NotReady   <none>   38d   v1.26.10


$ kubectl get node vtester1 -o yaml | yq .spec.taints
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: "2025-02-07T15:44:35Z"
- effect: NoExecute
key: node.kubernetes.io/unreachable
timeAdded: "2025-02-07T15:44:40Z"

$ kubectl get node vtester1 -o yaml | yq .status.conditions
- lastHeartbeatTime: "2025-02-05T03:55:06Z"
  lastTransitionTime: "2025-02-05T03:55:06Z"
  message: Flannel is running on this node
  reason: FlannelIsUp
  status: "False"
  type: NetworkUnavailable
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
  lastTransitionTime: "2025-02-07T15:51:10Z"
  message: Kubelet stopped posting node status.
  reason: NodeStatusUnknown
  status: Unknown
  type: MemoryPressure
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
  lastTransitionTime: "2025-02-07T15:51:10Z"
  message: Kubelet stopped posting node status.
  reason: NodeStatusUnknown
  status: Unknown
  type: DiskPressure
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
  lastTransitionTime: "2025-02-07T15:51:10Z"
  message: Kubelet stopped posting node status.
  reason: NodeStatusUnknown
  status: Unknown
  type: PIDPressure
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
  lastTransitionTime: "2025-02-07T15:51:10Z"
  message: Kubelet stopped posting node status.
  reason: NodeStatusUnknown
  status: Unknown
  type: Ready

这是因为节点生命周期控制器同时也在维护 Node 对象。

节点生命周期控制器

节点生命周期控制器 node-life-cycle-controller 相对简单易懂,源码在 pkg/controller/nodelifecycle/node_lifecycle_controller.go

我们带着三个疑问来看它的源码:

  1. NoScheduleNoExecute 污点从何而来
  2. 状态 conditions 列表中的 Ready 条目是如何更新为 Unknown 状态的
  3. Pod 驱逐的触发逻辑
var (
    // UnreachableTaintTemplate is the taint for when a node becomes unreachable.
    UnreachableTaintTemplate = &v1.Taint{
        Key:    v1.TaintNodeUnreachable,
        Effect: v1.TaintEffectNoExecute,
    }
)

func (nc *Controller) doNoExecuteTaintingPass(ctx context.Context) {
        // a lot of code here
        zoneNoExecuteTainterWorker.Try(func(value scheduler.TimedValue) (bool, time.Duration) {
            node, err := nc.nodeLister.Get(value.Value)
            if apierrors.IsNotFound(err) {
                klog.Warningf("Node %v no longer present in nodeLister!", value.Value)
                return true, 0
            } else if err != nil {
                klog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
                // retry in 50 millisecond
                return false, 50 * time.Millisecond
            }
            _, condition := controllerutil.GetNodeCondition(&node.Status, v1.NodeReady)
            // Because we want to mimic NodeStatus.Condition["Ready"] we make "unreachable" and "not ready" taints mutually exclusive.
            taintToAdd := v1.Taint{}
            oppositeTaint := v1.Taint{}
            switch condition.Status {
            case v1.ConditionFalse:
                taintToAdd = *NotReadyTaintTemplate
                oppositeTaint = *UnreachableTaintTemplate
            case v1.ConditionUnknown:
                taintToAdd = *UnreachableTaintTemplate
                oppositeTaint = *NotReadyTaintTemplate
            default:
                // It seems that the Node is ready again, so there's no need to taint it.
                klog.V(4).Infof("Node %v was in a taint queue, but it's ready now. Ignoring taint request.", value.Value)
                return true, 0
            }
            result := controllerutil.SwapNodeControllerTaint(ctx, nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{&oppositeTaint}, node)
            if result {
                //count the evictionsNumber
                zone := nodetopology.GetZoneKey(node)
                evictionsNumber.WithLabelValues(zone).Inc()
                evictionsTotal.WithLabelValues(zone).Inc()
            }

            return result, 0
        })
}

condition 列表中的 Ready 条目为 Unknown 状态时,doNoExecuteTaintingPass 方法会给节点打上 key 为 node.kubernetes.io/unreachableNoExecute 污点。而 doNoExecuteTaintingPass 方法在一个单独的 goroutine 中运行,每 100ms 循环执行一次 https://github.com/kubernetes/kubernetes/blob/1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L565

func (nc *Controller) Run(ctx context.Context) {
    if nc.runTaintManager {
        // Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
        // taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
        go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
    }
}

再来看 NoSchedule 污点 https://github.com/kubernetes/kubernetes/blob/1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L606-L659

var (
    nodeConditionToTaintKeyStatusMap = map[v1.NodeConditionType]map[v1.ConditionStatus]string{
        v1.NodeReady: {
            v1.ConditionFalse:   v1.TaintNodeNotReady,
            v1.ConditionUnknown: v1.TaintNodeUnreachable,
        },
        v1.NodeMemoryPressure: {
            v1.ConditionTrue: v1.TaintNodeMemoryPressure,
        },
        v1.NodeDiskPressure: {
            v1.ConditionTrue: v1.TaintNodeDiskPressure,
        },
        v1.NodeNetworkUnavailable: {
            v1.ConditionTrue: v1.TaintNodeNetworkUnavailable,
        },
        v1.NodePIDPressure: {
            v1.ConditionTrue: v1.TaintNodePIDPressure,
        },
    }
)

func (nc *Controller) doNoScheduleTaintingPass(ctx context.Context, nodeName string) error {
    // a lot of code here
    var taints []v1.Taint
    for _, condition := range node.Status.Conditions {
        if taintMap, found := nodeConditionToTaintKeyStatusMap[condition.Type]; found {
            if taintKey, found := taintMap[condition.Status]; found {
                taints = append(taints, v1.Taint{
                    Key:    taintKey,
                    Effect: v1.TaintEffectNoSchedule,
                })
            }
        }
    }
}

doNoScheduleTaintingPass 方法中根据 condition 列表中的 Ready 条目转换为 node.kubernetes.io/unreachable key 并给节点打上 NoSchedule 污点。而 doNoScheduleTaintingPass 方法也是在一个单独的 goroutine 中循环执行的。

doNoExecuteTaintingPassdoNoScheduleTaintingPass 方法打污点的依据都来自节点 condition 列中的条目。

func (nc *Controller) tryUpdateNodeHealth(ctx context.Context, node *v1.Node) (time.Duration, v1.NodeCondition, *v1.NodeCondition, error) {
    nodeHealth := nc.nodeHealthMap.getDeepCopy(node.Name)
    defer func() {
        nc.nodeHealthMap.set(node.Name, nodeHealth)
    }()

    _, currentReadyCondition := controllerutil.GetNodeCondition(&node.Status, v1.NodeReady)
    // a lot of code here
    if nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) {
        // NodeReady condition or lease was last set longer ago than gracePeriod, so
        // update it to Unknown (regardless of its current value) in the master.

        nodeConditionTypes := []v1.NodeConditionType{
            v1.NodeReady,
            v1.NodeMemoryPressure,
            v1.NodeDiskPressure,
            v1.NodePIDPressure,
            // We don't change 'NodeNetworkUnavailable' condition, as it's managed on a control plane level.
            // v1.NodeNetworkUnavailable,
        }

        nowTimestamp := nc.now()
        for _, nodeConditionType := range nodeConditionTypes {
            _, currentCondition := controllerutil.GetNodeCondition(&node.Status, nodeConditionType)
            if currentCondition == nil {
                // a lot of code here
            } else {
                klog.V(2).Infof("node %v hasn't been updated for %+v. Last %v is: %+v",
                    node.Name, nc.now().Time.Sub(nodeHealth.probeTimestamp.Time), nodeConditionType, currentCondition)
                if currentCondition.Status != v1.ConditionUnknown {
                    currentCondition.Status = v1.ConditionUnknown
                    currentCondition.Reason = "NodeStatusUnknown"
                    currentCondition.Message = "Kubelet stopped posting node status."
                    currentCondition.LastTransitionTime = nowTimestamp
                }
            }
        }
        // We need to update currentReadyCondition due to its value potentially changed.
        _, currentReadyCondition = controllerutil.GetNodeCondition(&node.Status, v1.NodeReady)

        if !apiequality.Semantic.DeepEqual(currentReadyCondition, &observedReadyCondition) {
            if _, err := nc.kubeClient.CoreV1().Nodes().UpdateStatus(ctx, node, metav1.UpdateOptions{}); err != nil {
                klog.Errorf("Error updating node %s: %v", node.Name, err)
                return gracePeriod, observedReadyCondition, currentReadyCondition, err
            }
            nodeHealth = &nodeHealthData{
                status:                   &node.Status,
                probeTimestamp:           nodeHealth.probeTimestamp,
                readyTransitionTimestamp: nc.now(),
                lease:                    observedLease,
            }
            return gracePeriod, observedReadyCondition, currentReadyCondition, nil
        }
    }
}

如果“心跳”超时(gracePeriod),将所有类型的 condition 条目的状态都变更为 Unknown。该超时时间默认为 40s:

func RecommendedDefaultNodeLifecycleControllerConfiguration(obj *kubectrlmgrconfigv1alpha1.NodeLifecycleControllerConfiguration) {
    // a lot of code here
    if obj.NodeMonitorGracePeriod == zero {
        obj.NodeMonitorGracePeriod = metav1.Duration{Duration: 40 * time.Second}
    }
}

顺着调用链 nc.tryUpdateNodeHealth <- nc.monitorNodeHealth,而 monitorNodeHealth 方法同样在一个单独的 goroutine 中循环。

最后来到 Pod 驱逐,node-life-cycle-controller 中的 doEvictionPass 方法比较有迷惑性。实际上该方法默认是不会被执行到的 https://github.com/kubernetes/kubernetes/blob/1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L562-L571

func (nc *Controller) Run(ctx context.Context) {
    // a lot of code here
    if nc.runTaintManager {
        // Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
        // taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
        go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
    } else {
        // Managing eviction of nodes:
        // When we delete pods off a node, if the node was not empty at the time we then
        // queue an eviction watcher. If we hit an error, retry deletion.
        go wait.UntilWithContext(ctx, nc.doEvictionPass, scheduler.NodeEvictionPeriod)
    }
}

在开启 TaintManager(默认)时,只有 doNoExecuteTaintingPass 方法才会在 goroutine 中循环执行。Pod 驱逐操作是由 TaintManager(NoExecuteTaintManager)实现的:

func NewNoExecuteTaintManager(ctx context.Context, c clientset.Interface, podLister corelisters.PodLister, nodeLister corelisters.NodeLister, getPodsAssignedToNode GetPodsByNodeNameFunc) *NoExecuteTaintManager {
    eventBroadcaster := record.NewBroadcaster()
    recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "taint-controller"})

    tm := &NoExecuteTaintManager{
        client:                c,
        broadcaster:           eventBroadcaster,
        recorder:              recorder,
        podLister:             podLister,
        nodeLister:            nodeLister,
        getPodsAssignedToNode: getPodsAssignedToNode,
        taintedNodes:          make(map[string][]v1.Taint),

        nodeUpdateQueue: workqueue.NewNamed("noexec_taint_node"),
        podUpdateQueue:  workqueue.NewNamed("noexec_taint_pod"),
    }
    tm.taintEvictionQueue = CreateWorkerQueue(deletePodHandler(c, tm.emitPodDeletionEvent))

    return tm
}

NewNoExecuteTaintManager -> deletePodHandler -> addConditionAndDeletePod

func addConditionAndDeletePod(ctx context.Context, c clientset.Interface, name, ns string) (err error) {
    // a lot of code here
    return c.CoreV1().Pods(ns).Delete(ctx, name, metav1.DeleteOptions{})
}

简单粗暴,直接删掉原 Pod。但是 Pod 驱逐往往会引发一系列的问题,比如原节点并非彻底宕机,只是 kubelet 故障或者网络中断,如果 Pod 刚好又在使用 PV 且访问模式是 ReadWriteOnce 访问模式,那么新创建的 Pod 是无法顺利启动的(无法挂载 PV),这时往往需要手动介入运维。

Pod 驱逐 API

当我们自己面向 Kubernetes 编写业务代码时,有时可以这样调用 Pod 驱逐 API 而非直接删除 Pod:

import (
    policyv1 "k8s.io/api/policy/v1"
    "sigs.k8s.io/controller-runtime/pkg/client"
)

// new a client instance firstly
// client :=
// a lot of code here
client.SubResource("eviction").Create(ctx, &pod, &policyv1.Eviction{
    // custom options
})

HTTP API 路径大家可以自己动手通过 kubectl drain ${NODE} 命令来探索,就不展开了。这种方式的好处是让调度器干预 Pod 删除,“策略性”地驱逐,尽可能小地影响服务的可用性:比如当 Pod 有 PodDisruptionBudget 限制时,Pod 可能不会被立即删除。