ReadWriteOnce PV 与 Pod 迁移

Feb 7, 2024 15:00 · 2109 words · 5 minute read Storage Container Kubernetes

使用 ReadWriteOnce 访问模式 Ceph RBD PV 的 Pod 在“迁移”后无法正常启动,会阻塞在 ContainerCreating 状态。

PVC 和 Deployent 定义如下:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /var/log/nginx
          name: log-vol
      volumes:
      - name: log-vol
        persistentVolumeClaim:
          claimName: pvc-test

当 nginx Pod 所在节点异常(kubelet 挂掉):

$ $ kubectl get nodes
NAME    STATUS     ROLES                  AGE    VERSION
mec51   Ready      control-plane,worker   103d   v1.27.2
mec52   NotReady   control-plane,worker   103d   v1.27.2
mec53   Ready      control-plane,worker   103d   v1.27.2

$ kubectl get po -o wide
NAME                     READY   STATUS              RESTARTS   AGE     IP           NODE    NOMINATED NODE   READINESS GATES
nginx-5ccff8b49c-6w5p7   0/1     ContainerCreating   0          23s     <none>       mec51   <none>           <none>
nginx-5ccff8b49c-pc2z4   1/1     Terminating         0          7m15s   172.10.0.2   mec52   <none>           <none>

因为 PVC 的访问模式为 ReadWriteOnce,kube-controller-manager 中的 attachdetach-controller 在 PVC 所绑定的 PV 从原节点上 detach 前,不会将 PV attach 至新的节点。

PV 与节点的绑定关系记录在 VolumeAttachment 对象中,删除相关 VolumeAttachment:

$ kubectl get volumeattachments | grep pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93
csi-e94a3f5697b001e59f9f61f17f55d76d34b0e63468e00bc75571d5cbf3d2c79c   rook-ceph.rbd.csi.ceph.com      pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93   mec52   true       9m41s
$ kubectl delete volumeattachments csi-e94a3f5697b001e59f9f61f17f55d76d34b0e63468e00bc75571d5cbf3d2c79c

虽然 VolumeAttachment 会由 kube-controller-manager 重建,跟随新的 Pod nginx-5ccff8b49c-6w5p7 绑定至 mec51 节点,但 Ceph RBD CSI node 插件仍拒绝将 PV 所关联的 RBD 卷 attach 至新节点:

# node mec51
$ journal -u kubelet -f
Feb 01 14:26:24 mec51 kubelet[4010]: E0201 14:26:24.206515    4010 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0012-rook-ceph-external-0000000000000002-10ff5cc1-c00b-11ee-bb20-fa163e1aa998 podName: nodeName:}" failed. No retries permitted until 2024-02-01 14:26:40.206480124 +0800 CST m=+7759.293368765 (durationBeforeRetry 16s). Error: MountVolume.MountDevice failed for volume "pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0012-rook-ceph-external-0000000000000002-10ff5cc1-c00b-11ee-bb20-fa163e1aa998") pod "nginx-5ccff8b49c-6w5p7" (UID: "e85b2bad-d4d4-4739-b187-25d0a11008d3") : rpc error: code = Internal desc = rbd image mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998 is still being used

这是 kubelet NodeStageVolume RPC 调用失败。

查看 Ceph RBD CSI node 插件源码:

调用链为 NodeStageVolume -> stageTransaction -> attachRBDImage -> waitForrbdImage

func waitForrbdImage(ctx context.Context, backoff wait.Backoff, volOptions *rbdVolume) error {
    imagePath := volOptions.String()

    err := wait.ExponentialBackoff(backoff, func() (bool, error) {
        used, err := volOptions.isInUse()
        if err != nil {
            return false, fmt.Errorf("fail to check rbd image status: (%w)", err)
        }
        if (volOptions.DisableInUseChecks) && (used) {
            log.UsefulLog(ctx, "valid multi-node attach requested, ignoring watcher in-use result")

            return used, nil
        }

        return !used, nil
    })
    // return error if rbd image has not become available for the specified timeout
    if errors.Is(err, wait.ErrWaitTimeout) {
        return fmt.Errorf("rbd image %s is still being used", imagePath)
    }
    // return error if any other errors were encountered during waiting for the image to become available
    return err
}

Ceph RBD CSI node 插件会先查看 RBD 卷是否正在被使用:

https://github.com/ceph/ceph-csi/blob/a67bf8928cf03a70ae885a66780f052bef6956de/internal/rbd/rbd_util.go#L511-L544

// isInUse checks if there is a watcher on the image. It returns true if there
// is a watcher on the image, otherwise returns false.
func (ri *rbdImage) isInUse() (bool, error) {
    image, err := ri.open()
    if err != nil {
        if errors.Is(err, ErrImageNotFound) || errors.Is(err, util.ErrPoolNotFound) {
            return false, err
        }
        // any error should assume something else is using the image
        return true, err
    }
    defer image.Close()

    watchers, err := image.ListWatchers()
    if err != nil {
        return false, err
    }

    mirrorInfo, err := image.GetMirrorImageInfo()
    if err != nil {
        return false, err
    }
    ri.Primary = mirrorInfo.Primary

    // because we opened the image, there is at least one watcher
    defaultWatchers := 1
    if ri.Primary {
        // if rbd mirror daemon is running, a watcher will be added by the rbd
        // mirror daemon for mirrored images.
        defaultWatchers++
    }

    return len(watchers) > defaultWatchers, nil
}

根据当前 RBD 卷 watcher 的数量来判断该卷是否正在被使用(attach 至节点上)。

$ $ rbd status mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998
Watchers:
        watcher=192.168.73.52:0/319745466 client.2474196 cookie=18446462598732840999

192.168.73.52 是 mec52 节点的 IP,而此节点完全宕机,只是 kubelet 不在运行。

手动将 mec52 节点加入 Ceph OSD 黑名单:

$ ceph osd blacklist add 192.168.73.52
blocklisting 192.168.73.52:0/319745466 until 2024-02-01T07:57:05.125665+0000 (3600 sec)

$ rbd status mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998
Watchers: none

过一会后 RBD 卷被 node 插件成功 attach 至 mec51 节点,新的 Pod 就变会变成 Running 状态:

$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5ccff8b49c-6w5p7   1/1     Running   0          45m

随后清空 OSD 黑名单:

$ ceph osd blacklist clear

接下来介绍利用 CSI-Addons 和 NetworkFence 实现上述加入黑名单操作。

CSI-Addons

需要确认 CSI Driver 是否实现 CSI-Addons 规范

顾名思义,CSI-Addons 是对 CSI 现有能力的扩展与增强。

.------.   CR  .------------.
| User |-------| CSI-Addons |
'------'       | Controller |
               '------------'
                      |
                      | gRPC
                      |
            .---------+------------------------------.
            |         |                              |
            |  .------------.        .------------.  |
            |  | CSI-Addons |  gRPC  |    CSI     |  |
            |  |  sidecar  |--------| Controller |  |
            |  '------------'        | NodePlugin |  |
            |                        '------------'  |
            | CSI-driver Pod                         |
            '----------------------------------------'
  1. 同 Kubernetes 官方的 CSI sidecar 容器一样,CSI-Addons 也需要在 CSI driver Pod 内额外部署一个 CSI sidecar 容器:

    # 原先 csi-rbdplugin-provisioner Pod 内容器数量 5 个
    $ kubectl get po -n rook-ceph -l "app=csi-rbdplugin-provisioner"
    NAME                                         READY   STATUS    RESTARTS   AGE
    csi-rbdplugin-provisioner-77fbb96487-hg8jz   6/6     Running   0          2d19h
    csi-rbdplugin-provisioner-77fbb96487-w2zpj   6/6     Running   0          2d19h
    # 原先 csi-rbdplugin Pod 内容器数量 2 个
    $ kubectl get po -n rook-ceph -l "app=csi-rbdplugin"
    NAME                  READY   STATUS    RESTARTS   AGE
    csi-rbdplugin-62h5f   3/3     Running   0          2d19h
    csi-rbdplugin-qb7g9   3/3     Running   0          2d19h
    csi-rbdplugin-tsphv   3/3     Running   0          2d19h
    

    如果使用 rook 部署 Ceph CSI Driver,在 rook-ceph-operator 的配置中设置 CSI_ENABLE_CSIADDONS: true,或者在 helm 部署时就开启。

  2. CSI-Addons controller 部分还是需要额外部署:

    $ kubectl apply -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/crds.yaml
    $ kubectl apply -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/rbac.yaml
    $ curl -s https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/setup-controller.yaml | sed 's/k8s-controller:latest/k8s-controller:v0.7.0/g' | kubectl create -f - # 注意修改镜像版本至 v0.7.0
    

CSIAddonsNode 会由 csi-rbdplugin 中的 CSI Addons sidecar 自动创建出来,无需手动创建:

$ kubectl get CSIAddonsNode -A
NAMESPACE   NAME                                         NAMESPACE   AGE   DRIVERNAME                   ENDPOINT           NODEID
rook-ceph   csi-rbdplugin-62h5f                          rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.73:9070   mec52
rook-ceph   csi-rbdplugin-provisioner-77fbb96487-hg8jz   rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.40:9070   mec51
rook-ceph   csi-rbdplugin-provisioner-77fbb96487-w2zpj   rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.46:9070   mec52
rook-ceph   csi-rbdplugin-qb7g9                          rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.69:9070   mec51
rook-ceph   csi-rbdplugin-tsphv                          rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.50:9070   mec53

CSI-Addons controller 通过 CSIAddonsNode 对象中的 Endpoint 信息与 CSI Driver Pod 中的 sidecar 通信。

创建 NetworkFence CR 来自动将 mec52 节点的 IP 加入 OSD 黑名单:

$ cat <<EOF | kubectl apply -f -
apiVersion: csiaddons.openshift.io/v1alpha1
kind: NetworkFence
metadata:
  name: network-fence-sample
spec:
  driver: rook-ceph.rbd.csi.ceph.com # 固定
  fenceState: Unfenced
  cidrs:
    - 192.168.73.52/32 # node mec52's IP
  secret:
    name: rook-csi-rbd-provisioner # 固定
    namespace: rook-ceph-external # 固定
  parameters:
    clusterID: rook-ceph-external # 固定
EOF

$ kubectl get NetworkFence network-fence-sample
NAME                   DRIVER                       CIDRS                  FENCESTATE   AGE   RESULT
network-fence-sample   rook-ceph.rbd.csi.ceph.com   ["192.168.73.52/32"]   Fenced       7s    Succeeded

$ ceph osd blacklist ls
192.168.73.52:0/0 2029-02-06T08:13:22.359191+0000
listed 1 entries

NetworkFence 使用 .spec.fenceState 来设置是否拉黑 CIDR,将其修改为 Unfenced 来把 mec52 节点的 IP 移出黑名单:

$ kubectl patch NetworkFence network-fence-sample -p '{"spec":{"fenceState": "Unfenced"}}' --type=merge

$ kubectl get NetworkFence network-fence-sample
NAME                   DRIVER                       CIDRS                  FENCESTATE   AGE     RESULT
network-fence-sample   rook-ceph.rbd.csi.ceph.com   ["192.168.73.52/32"]   Unfenced     2d18h   Succeeded

$ ceph osd blacklist ls
listed 0 entries

建议预先为集群中每个节点的 IP 都创建一个 Unfenced 状态的 NetworkFence 对象,有需要时直接 patch 为 Fenced

NetworkFence 实现原理

毫无疑问,CSI-Addons controller 会监听 NetworkFence 对象:

https://github.com/csi-addons/kubernetes-csi-addons/blob/1a8fa3b68874313d9f086cbb7324fc29e01a9eef/controllers/csiaddons/networkfence_controller.go

从 CSIAddonsNode 的端点中选出存活的一个作为目标 server,封装 FenceClusterNetwork 请求并发送。

CSI-Addons sidecar 收到请求后,将其转发至 CSI Driver Pod:

https://github.com/csi-addons/kubernetes-csi-addons/blob/339f863d5cbd9488ac235409bd854b60963028ba/internal/sidecar/service/networkfence.go#L54-L80

所以 CSI Driver 必须已经实现好 CSI-Addons 规范,这里 Ceph RBD CSI 中的 FenceControllerServer 收到 FenceClusterNetwork 请求:

https://github.com/ceph/ceph-csi/blob/29782bf377907e5d1e9413a0f148e1cc7f77693b/internal/csi-addons/rbd/network_fence.go#L58-L86

func (fcs *FenceControllerServer) FenceClusterNetwork(
    ctx context.Context,
    req *fence.FenceClusterNetworkRequest) (*fence.FenceClusterNetworkResponse, error) {
    // a lot of code here
    nwFence, err := nf.NewNetworkFence(ctx, cr, req.Cidrs, req.GetParameters())
    if err != nil {
        return nil, status.Error(codes.Internal, err.Error())
    }

    err = nwFence.AddNetworkFence(ctx)
    if err != nil {
        return nil, status.Errorf(codes.Internal, "failed to fence CIDR block %q: %s", nwFence.Cidr, err.Error())
    }

    return &fence.FenceClusterNetworkResponse{}, nil
}

func (nf *NetworkFence) AddNetworkFence(ctx context.Context) error {
    // for each CIDR block, convert it into a range of IPs so as to perform blocklisting operation.
    for _, cidr := range nf.Cidr {
        // fetch the list of IPs from a CIDR block
        hosts, err := getIPRange(cidr)
        if err != nil {
            return fmt.Errorf("failed to convert CIDR block %s to corresponding IP range: %w", cidr, err)
        }

        // add ceph blocklist for each IP in the range mentioned by the CIDR
        for _, host := range hosts {
            err = nf.addCephBlocklist(ctx, host)
            if err != nil {
                return err
            }
        }
    }

    return nil
}

最后来到 addCephBlocklist 方法,其实和我们上面手动执行 ceph osd blacklist add 命令是一模一样的:

func (nf *NetworkFence) addCephBlocklist(ctx context.Context, ip string) error {
    arg := []string{
        "--id", nf.cr.ID,
        "--keyfile=" + nf.cr.KeyFile,
        "-m", nf.Monitors,
    }
    // TODO: add blocklist till infinity.
    // Currently, ceph does not provide the functionality to blocklist IPs
    // for infinite time. As a workaround, add a blocklist for 5 YEARS to
    // represent infinity from ceph-csi side.
    // At any point in this time, the IPs can be unblocked by an UnfenceClusterReq.
    // This needs to be updated once ceph provides functionality for the same.
    cmd := []string{"osd", "blocklist", "add", ip, blocklistTime}
    cmd = append(cmd, arg...)
    _, _, err := util.ExecCommand(ctx, "ceph", cmd...)
    if err != nil {
        return fmt.Errorf("failed to blocklist IP %q: %w", ip, err)
    }
    log.DebugLog(ctx, "blocklisted IP %q successfully", ip)

    return nil
}

这里 Ceph RBD CSI Driver 直接执行命令 ceph osd blacklist add,将 NetworkFence 对象 .spec.cidrs 中的 IP 地址一一拉黑。个人觉得这里写的不太好,应该使用 Ceph OSD 的 API 向 Ceph 发送拉黑请求。

总结

当节点异常(比如 kubelet 挂掉)时,使用 ReadWriteOnce 访问模式 Ceph RBD PV 的 Pod 在“迁移”到新节点后无法正常启动(这是一定的),可以考虑使用 CSI-Addons 的 NetworkFence API 将异常节点的 IP 地址拉黑后先保证业务 Pod 顺利启动,再排查或者修复原节点的故障。