Kubernetes 调度和资源管理

Sep 30, 2019 17:00 · 1089 words · 3 minute read Kubernetes

调度过程

把 Pod 放到合适的 Node 上去

  • 满足 Pod 资源要求
  • 满足 Pod 的特殊关系要求
  • 满足 Node 限制条件要求
  • 做到集群资源合理利用

基础调度能力

  • 资源调度 - 满足 Pod 资源要求
    • 资源 request/limit
      • CPU 1=1000m
      • 内存 1Gi=1024Mi
      • 存储
      • GPU
      • FPGA
    • QoS
      • Guaranteed 保障(高)
      • Burstable 弹性(中)
      • BestEffort 尽力而为(低)
    • 资源配额
  • 关系调度 - 满足 Pod/Node 特殊关系/条件要求
    • Pod 和 Pod 间关系
      • PodAffinity
      • PodAntiAffinity
    • 由 Pod 决定适合自己的 Node
      • NodeSelector
      • NodeAffinity
    • 限制调度到某些 Node
      • Taint
      • Tolerations

资源调度

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
    resource:
      requests:
        cpu: 2
        memory: 1Gi
      limits:
        cpu: 2
        memory: 1Gi

Kubernetes 无法手动定义 QoS

Guaranteed

CPU/Mem 必须 request==limit

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
    resource:
      requests:
        cpu: 2
        memory: 1Gi
      limits:
        cpu: 2
        memory: 1Gi

Burstable

CPU/Mem request 和 limit 不等

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
    resource:
      requests:
        cpu: 2
        memory: 1Gi

BestEffort

所有资源 request/limit 都不填

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container

不同的 QoS

  • 调度表现不同
    • 调度器会使用 request 进行调度
  • 底层表现不同
    • CPU 按照 request 划分权重
    • Mem 按 QoS 划分 OOMScore
      • Guaranteed -998
      • Burstable 2~999
      • BestEffort 1000
    • Eviction
      • 优先 BestEffort
      • Kubelet

资源配额

限制每个 Namespace 资源用量,当配额用超过后会禁止创建

apiVersion: v1
kind: ResourceQuota
metadata:
  name: demo-quota
  namespace: demo-ns
spec:
  hard:
    cpu: 1000
    memory: 200Gi
    pods: 10
  scopeSelector:
    matchExpressions:
    - operator: Exists
      scopeName: NotBestEffort

scope:

  • Terminating/NotTerminating

  • BestEffort/NotBestEffort

  • PriorityClass

  • Pod 要配置合理的资源要求

    • CPU/Mem/EphemeralStorage/GPU
  • 通过 request 和 limit 来为不同业务特点的 Pod 选择不同的 QoS

    • Guaranteed 敏感型、需要保障的业务
    • Burstable 次敏感型、需要弹性的业务
    • BestEffort 可容忍型业务
  • 为每个命名空间配置 ResourceQuota 来防止过量使用,保障其他人的资源可用

亲和调度

Pod - Pod

  • Pod 亲和调度 PodAffinity
    • 必须和某些 Pod 调度到一起

      requiredDuringSchedulingIgnoredDuringExecution

    • 优先和某些 Pod 调度到一起

      preferredDuringSchedulingIgnoredDuringExecution

  • Pod 反亲和调度 PodAntiAffinity
    • 禁止和某系 Pod 调度到一起

      requiredDuringSchedulingIgnoredDuringExecution

    • 优先不和某些 Pod 调度

      preferredDuringSchedulingIgnoredDuringExecution

  • operator
    • In
    • NotIn
    • Exists
    • DoesNotExist
apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution
      - labelSelector:
        matchExpressions:
        - key: k1
          operator: In
          values:
          - v1
        topologykey: "kubernetes.io/hostname"
apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution
      - labelSelector:
        matchExpressions:
        - key: k1
          operator: In
          values:
          - v1
        topologykey: "kubernetes.io/hostname"

Pod - Node

  • NodeSelector
    • 必须调度到带了某些标签的 Node
    • Map[string]string
  • NodeAffinity
    • 必须调度到某些 Node 上

      requiredDuringSchedulingIgnoredDuringExecution

    • 优先调度到某些 Node 上

      preferredDuringSchedulingIgnoredDuringExecution

    • operator

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  nodeSelector:
    k1: v1
apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: k1
            operator: In
            values:
            - v1

Node 污点/容忍

  • Taint (Node)
    • 一个 Node 可以有多个 Taints
    • Effect(Taint 的行为)
      • NoSchedule 禁止新的 Pod 调度上来
      • PreferNoSchedule 尽量不调度到这台
      • NoExecute 会驱逐不能容忍的 Pod
  • Toleration (Pod)
    • 一个 Pod 可以有多个 Tolerations
    • Effect 可以为空,匹配所有
    • operator
      • Exists
      • Equal
apiVersion: v1
kind: Node
metadata:
  name: demo-node
spec:
  taints:
  - key: k1
    value: v1
    effect: NoSchedule
apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  tolerations:
  - key: k1
    operator: Equal
    value: v1
    effect: NoSchedule

Kubernetes 高级调度能力

  • 优先级抢占调度
    • Priority
    • Preemption

优先级调度配置

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high
value: 10000
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low
value: 100
globalDefault: false

优先级:

  • 默认优先级

    DefaultPriorityWhenNoDefaultClassExists=0

  • 用户可配置的最大优先级限制

    HighestUserDefaultPriority=1000000000

  • 系统级别优先级

    SystemCriticalPriority=200000000

  • 内置系统级别优先级

    • system-cluster-critical
    • system-node-critical

优先级调度过程:

  1. Pod2 和 Pod1 先后进入调度队列,但均未被调度
  2. 当进行调度时,PriorityQueue 会优先 Pod 优先级更大的 Pod1 出队列镜像调度
  3. 调度成功后,下一轮调度 Pod2

优先级抢占过程:

  1. Pod2 先进行调度,调度成功后被分配至 Node1 上运行
  2. 之后 Pod1 再进行调度,由于 Node1 资源不足出现调度失败,此时进入抢占流程
  3. 在经过抢占算法计算后,选中 Pod2 为 Pod1 让渡
  4. 驱逐 Node1 上运行的 Pod2,并将 Pod1 调度至 Node1