Kubernetes 节点生命周期控制器
Feb 9, 2025 00:00 · 2796 words · 6 minute read
OpenYurt 项目拥有边缘自治功能:在云边网络断连时,边缘节点重启或者业务容器重启的情况下,业务容器能够在边缘节点自动恢复,而不会被云端的 kube-controller-manager 驱逐、重新调度。开启该功能需要关闭 kube-controller-manager 中的 node-lifecycle-controller,即在启动参数中带上 --controllers=-node-lifecycle-controller
,开启 YurtManager 中自己实现的 node-lifecycle-controller,从而实现其边缘自治功能的业务逻辑。
kube-controller-manager 中包含了许多控制器,其中的 node-lifecycle-controller 负责监控节点的健康状态,当节点不健康时会驱逐 Pod。
节点心跳
在分布式系统中,通常通过定时的“心跳”来表示某对象是否处于存活状态。心跳有各种实现方法,比如定期更新/申请租约,而 Kubernetes 中的节点(Node)心跳则是由通过 kubelet 定期更新 Node 状态中的 conditions
列表中的 Ready
条目来表示。
$ k get node vtester0 -o json | jq '.status.conditions[] | select(.type == "Ready")'
{
"lastHeartbeatTime": "2025-02-02T15:32:41Z",
"lastTransitionTime": "2024-12-31T08:54:57Z",
"message": "kubelet is posting ready status",
"reason": "KubeletReady",
"status": "True",
"type": "Ready"
}
kubelet 默认的节点状态更新周期为 5 分钟,可通过修改配置文件 /var/lib/kubelet/config.yaml 中的 nodeStatusUpdateFrequency
字段来自定义。嵌套深,请自行阅读该部分源码。
func SetDefaults_KubeletConfiguration(obj *kubeletconfigv1beta1.KubeletConfiguration) {
// a lot of code here
if obj.NodeStatusReportFrequency == zeroDuration {
// For backward compatibility, NodeStatusReportFrequency's default value is
// set to NodeStatusUpdateFrequency if NodeStatusUpdateFrequency is set
// explicitly.
if obj.NodeStatusUpdateFrequency == zeroDuration {
obj.NodeStatusReportFrequency = metav1.Duration{Duration: 5 * time.Minute}
} else {
obj.NodeStatusReportFrequency = obj.NodeStatusUpdateFrequency
}
}
}
当各种原因导致 kubelet 挂掉,无法及时更新 conditions
列表中的 Ready
条目,kubectl get node
命令会显示节点状态为 NotReady
,这时我们发现节点被打上了两个污点,状态 conditions
列表中的所有条目都被更新为 NodeStatusUnknown
。
$ kubectl get node vtester1 -w
NAME STATUS ROLES AGE VERSION
vtester1 Ready <none> 38d v1.26.10
vtester1 NotReady <none> 38d v1.26.10
vtester1 NotReady <none> 38d v1.26.10
vtester1 NotReady <none> 38d v1.26.10
$ kubectl get node vtester1 -o yaml | yq .spec.taints
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: "2025-02-07T15:44:35Z"
- effect: NoExecute
key: node.kubernetes.io/unreachable
timeAdded: "2025-02-07T15:44:40Z"
$ kubectl get node vtester1 -o yaml | yq .status.conditions
- lastHeartbeatTime: "2025-02-05T03:55:06Z"
lastTransitionTime: "2025-02-05T03:55:06Z"
message: Flannel is running on this node
reason: FlannelIsUp
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
lastTransitionTime: "2025-02-07T15:51:10Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: MemoryPressure
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
lastTransitionTime: "2025-02-07T15:51:10Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: DiskPressure
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
lastTransitionTime: "2025-02-07T15:51:10Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: PIDPressure
- lastHeartbeatTime: "2025-02-07T15:46:10Z"
lastTransitionTime: "2025-02-07T15:51:10Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: Ready
这是因为节点生命周期控制器同时也在维护 Node 对象。
节点生命周期控制器
节点生命周期控制器 node-life-cycle-controller 相对简单易懂,源码在 pkg/controller/nodelifecycle/node_lifecycle_controller.go。
我们带着三个疑问来看它的源码:
NoSchedule
和NoExecute
污点从何而来- 状态
conditions
列表中的Ready
条目是如何更新为Unknown
状态的 - Pod 驱逐的触发逻辑
var (
// UnreachableTaintTemplate is the taint for when a node becomes unreachable.
UnreachableTaintTemplate = &v1.Taint{
Key: v1.TaintNodeUnreachable,
Effect: v1.TaintEffectNoExecute,
}
)
func (nc *Controller) doNoExecuteTaintingPass(ctx context.Context) {
// a lot of code here
zoneNoExecuteTainterWorker.Try(func(value scheduler.TimedValue) (bool, time.Duration) {
node, err := nc.nodeLister.Get(value.Value)
if apierrors.IsNotFound(err) {
klog.Warningf("Node %v no longer present in nodeLister!", value.Value)
return true, 0
} else if err != nil {
klog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
// retry in 50 millisecond
return false, 50 * time.Millisecond
}
_, condition := controllerutil.GetNodeCondition(&node.Status, v1.NodeReady)
// Because we want to mimic NodeStatus.Condition["Ready"] we make "unreachable" and "not ready" taints mutually exclusive.
taintToAdd := v1.Taint{}
oppositeTaint := v1.Taint{}
switch condition.Status {
case v1.ConditionFalse:
taintToAdd = *NotReadyTaintTemplate
oppositeTaint = *UnreachableTaintTemplate
case v1.ConditionUnknown:
taintToAdd = *UnreachableTaintTemplate
oppositeTaint = *NotReadyTaintTemplate
default:
// It seems that the Node is ready again, so there's no need to taint it.
klog.V(4).Infof("Node %v was in a taint queue, but it's ready now. Ignoring taint request.", value.Value)
return true, 0
}
result := controllerutil.SwapNodeControllerTaint(ctx, nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{&oppositeTaint}, node)
if result {
//count the evictionsNumber
zone := nodetopology.GetZoneKey(node)
evictionsNumber.WithLabelValues(zone).Inc()
evictionsTotal.WithLabelValues(zone).Inc()
}
return result, 0
})
}
当 condition
列表中的 Ready
条目为 Unknown
状态时,doNoExecuteTaintingPass
方法会给节点打上 key 为 node.kubernetes.io/unreachable
的 NoExecute
污点。而 doNoExecuteTaintingPass
方法在一个单独的 goroutine 中运行,每 100ms 循环执行一次 https://github.com/kubernetes/kubernetes/blob/1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L565:
func (nc *Controller) Run(ctx context.Context) {
if nc.runTaintManager {
// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
}
}
再来看 NoSchedule
污点 https://github.com/kubernetes/kubernetes/blob/1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L606-L659:
var (
nodeConditionToTaintKeyStatusMap = map[v1.NodeConditionType]map[v1.ConditionStatus]string{
v1.NodeReady: {
v1.ConditionFalse: v1.TaintNodeNotReady,
v1.ConditionUnknown: v1.TaintNodeUnreachable,
},
v1.NodeMemoryPressure: {
v1.ConditionTrue: v1.TaintNodeMemoryPressure,
},
v1.NodeDiskPressure: {
v1.ConditionTrue: v1.TaintNodeDiskPressure,
},
v1.NodeNetworkUnavailable: {
v1.ConditionTrue: v1.TaintNodeNetworkUnavailable,
},
v1.NodePIDPressure: {
v1.ConditionTrue: v1.TaintNodePIDPressure,
},
}
)
func (nc *Controller) doNoScheduleTaintingPass(ctx context.Context, nodeName string) error {
// a lot of code here
var taints []v1.Taint
for _, condition := range node.Status.Conditions {
if taintMap, found := nodeConditionToTaintKeyStatusMap[condition.Type]; found {
if taintKey, found := taintMap[condition.Status]; found {
taints = append(taints, v1.Taint{
Key: taintKey,
Effect: v1.TaintEffectNoSchedule,
})
}
}
}
}
doNoScheduleTaintingPass
方法中根据 condition
列表中的 Ready
条目转换为 node.kubernetes.io/unreachable
key 并给节点打上 NoSchedule
污点。而 doNoScheduleTaintingPass
方法也是在一个单独的 goroutine 中循环执行的。
doNoExecuteTaintingPass
和 doNoScheduleTaintingPass
方法打污点的依据都来自节点 condition
列中的条目。
func (nc *Controller) tryUpdateNodeHealth(ctx context.Context, node *v1.Node) (time.Duration, v1.NodeCondition, *v1.NodeCondition, error) {
nodeHealth := nc.nodeHealthMap.getDeepCopy(node.Name)
defer func() {
nc.nodeHealthMap.set(node.Name, nodeHealth)
}()
_, currentReadyCondition := controllerutil.GetNodeCondition(&node.Status, v1.NodeReady)
// a lot of code here
if nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) {
// NodeReady condition or lease was last set longer ago than gracePeriod, so
// update it to Unknown (regardless of its current value) in the master.
nodeConditionTypes := []v1.NodeConditionType{
v1.NodeReady,
v1.NodeMemoryPressure,
v1.NodeDiskPressure,
v1.NodePIDPressure,
// We don't change 'NodeNetworkUnavailable' condition, as it's managed on a control plane level.
// v1.NodeNetworkUnavailable,
}
nowTimestamp := nc.now()
for _, nodeConditionType := range nodeConditionTypes {
_, currentCondition := controllerutil.GetNodeCondition(&node.Status, nodeConditionType)
if currentCondition == nil {
// a lot of code here
} else {
klog.V(2).Infof("node %v hasn't been updated for %+v. Last %v is: %+v",
node.Name, nc.now().Time.Sub(nodeHealth.probeTimestamp.Time), nodeConditionType, currentCondition)
if currentCondition.Status != v1.ConditionUnknown {
currentCondition.Status = v1.ConditionUnknown
currentCondition.Reason = "NodeStatusUnknown"
currentCondition.Message = "Kubelet stopped posting node status."
currentCondition.LastTransitionTime = nowTimestamp
}
}
}
// We need to update currentReadyCondition due to its value potentially changed.
_, currentReadyCondition = controllerutil.GetNodeCondition(&node.Status, v1.NodeReady)
if !apiequality.Semantic.DeepEqual(currentReadyCondition, &observedReadyCondition) {
if _, err := nc.kubeClient.CoreV1().Nodes().UpdateStatus(ctx, node, metav1.UpdateOptions{}); err != nil {
klog.Errorf("Error updating node %s: %v", node.Name, err)
return gracePeriod, observedReadyCondition, currentReadyCondition, err
}
nodeHealth = &nodeHealthData{
status: &node.Status,
probeTimestamp: nodeHealth.probeTimestamp,
readyTransitionTimestamp: nc.now(),
lease: observedLease,
}
return gracePeriod, observedReadyCondition, currentReadyCondition, nil
}
}
}
如果“心跳”超时(gracePeriod
),将所有类型的 condition 条目的状态都变更为 Unknown
。该超时时间默认为 40s:
func RecommendedDefaultNodeLifecycleControllerConfiguration(obj *kubectrlmgrconfigv1alpha1.NodeLifecycleControllerConfiguration) {
// a lot of code here
if obj.NodeMonitorGracePeriod == zero {
obj.NodeMonitorGracePeriod = metav1.Duration{Duration: 40 * time.Second}
}
}
顺着调用链 nc
.tryUpdateNodeHealth
<- nc
.monitorNodeHealth
,而 monitorNodeHealth
方法同样在一个单独的 goroutine 中循环。
最后来到 Pod 驱逐,node-life-cycle-controller 中的 doEvictionPass
方法比较有迷惑性。实际上该方法默认是不会被执行到的 https://github.com/kubernetes/kubernetes/blob/1649f592f1909b97aa3c2a0a8f968a3fd05a7b8b/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L562-L571:
func (nc *Controller) Run(ctx context.Context) {
// a lot of code here
if nc.runTaintManager {
// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
go wait.UntilWithContext(ctx, nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod)
} else {
// Managing eviction of nodes:
// When we delete pods off a node, if the node was not empty at the time we then
// queue an eviction watcher. If we hit an error, retry deletion.
go wait.UntilWithContext(ctx, nc.doEvictionPass, scheduler.NodeEvictionPeriod)
}
}
在开启 TaintManager(默认)时,只有 doNoExecuteTaintingPass
方法才会在 goroutine 中循环执行。Pod 驱逐操作是由 TaintManager(NoExecuteTaintManager
)实现的:
func NewNoExecuteTaintManager(ctx context.Context, c clientset.Interface, podLister corelisters.PodLister, nodeLister corelisters.NodeLister, getPodsAssignedToNode GetPodsByNodeNameFunc) *NoExecuteTaintManager {
eventBroadcaster := record.NewBroadcaster()
recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "taint-controller"})
tm := &NoExecuteTaintManager{
client: c,
broadcaster: eventBroadcaster,
recorder: recorder,
podLister: podLister,
nodeLister: nodeLister,
getPodsAssignedToNode: getPodsAssignedToNode,
taintedNodes: make(map[string][]v1.Taint),
nodeUpdateQueue: workqueue.NewNamed("noexec_taint_node"),
podUpdateQueue: workqueue.NewNamed("noexec_taint_pod"),
}
tm.taintEvictionQueue = CreateWorkerQueue(deletePodHandler(c, tm.emitPodDeletionEvent))
return tm
}
NewNoExecuteTaintManager
-> deletePodHandler
-> addConditionAndDeletePod
func addConditionAndDeletePod(ctx context.Context, c clientset.Interface, name, ns string) (err error) {
// a lot of code here
return c.CoreV1().Pods(ns).Delete(ctx, name, metav1.DeleteOptions{})
}
简单粗暴,直接删掉原 Pod。但是 Pod 驱逐往往会引发一系列的问题,比如原节点并非彻底宕机,只是 kubelet 故障或者网络中断,如果 Pod 刚好又在使用 PV 且访问模式是 ReadWriteOnce
访问模式,那么新创建的 Pod 是无法顺利启动的(无法挂载 PV),这时往往需要手动介入运维。
Pod 驱逐 API
当我们自己面向 Kubernetes 编写业务代码时,有时可以这样调用 Pod 驱逐 API 而非直接删除 Pod:
import (
policyv1 "k8s.io/api/policy/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
)
// new a client instance firstly
// client :=
// a lot of code here
client.SubResource("eviction").Create(ctx, &pod, &policyv1.Eviction{
// custom options
})
HTTP API 路径大家可以自己动手通过 kubectl drain ${NODE}
命令来探索,就不展开了。这种方式的好处是让调度器干预 Pod 删除,“策略性”地驱逐,尽可能小地影响服务的可用性:比如当 Pod 有 PodDisruptionBudget 限制时,Pod 可能不会被立即删除。