Kubernetes 块模式 PV
Oct 6, 2022 15:00 · 3142 words · 7 minute read
Kubernetes 通过 volumeMode
字段支持 Filesystem
(默认)和 Block
两种模式的持久化卷(PersistentVolume):
-
Filesystem(文件系统)
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: myclaim spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 8Gi storageClassName: fast
-
Block(块)
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: myclaim spec: accessModes: - ReadWriteOnce volumeMode: Block resources: requests: storage: 8Gi storageClassName: fast
Filesystem
模式的存储卷会以文件目录的形式挂载进 Pod。如果存储卷背后是一个空的块设备,Kubernetes 在挂载前会先在块设备上创建一个文件系统。将 volumeMode
的值设置为 Block
,存储卷会作为一个裸块设备来使用,在 Pod 中表现为没有文件系统的一个块设备。本文只探讨 Block
模式的 PersistentVolume(简写为 PV)在容器中表现为一个块设备的实现原理和底层技术细节。
涉及组件较多:
- 后端存储:OpenEBS v0.9.0
- Kubernetes:v1.23.6
- CRI 容器运行时:containerd v1.6.2
- OCI 容器运行时:runc v1.1.0
Pod 如何使用块模式的 PV 请自行查看官方文档 https://kubernetes.io/docs/concepts/storage/persistent-volumes/#pod-specification-adding-raw-block-device-path-in-container。
下文件中出现的 CSI_DIR
变量为 /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices 路径
$ kubectl exec -it virt-launcher-ecs-test13-wzvx6 -n ns-demo -- ls -al /dev/heredatadisk
brw------- 1 qemu qemu 253, 4 Sep 23 07:52 /dev/heredatadisk
virt-launcher-ecs-test13-wzvx6 Pod 所使用的 PV 在容器中表现为 /dev/heredatadisk 块设备文件,major(主设备号)为 253,用于定位设备驱动程序;minor(次设备号)为 4,作为参数传给驱动程序,选择相应的单元。
我们顺着 Kubernetes 在节点上创建 Pod 的流程来梳理整个过程,当 Pod 被成功调度后:
CSI 部分
-
OpenEBS 的 CSI node plugin,即 openebs-lvm-node DaemonSet Pod 在宿主机上创建 LVM 逻辑卷 /dev/lvmvg/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1
$ lsblk /dev/sdd NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdd 8:48 0 60G 0 disk └─lvmvg-pvc--aa9d7716--d027--438a--8d76--c95b46a31ce1 253:3 0 10G 0 lvm $ ll /dev/lvmvg/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1 lrwxrwxrwx 1 root root 7 Sep 26 09:57 /dev/lvmvg/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1 -> ../dm-3 $ ll /dev/dm-3 brw-rw---- 1 root disk 253, 4 Oct 6 13:03 /dev/dm-3
/dev/dm-3 块设备文件的主次设备号分别为 253 和 4,它的上一层路径 /dev 则是以 devtmpfs 的这个特殊的“文件系统”挂载到宿主机上的:
$ mount -l | grep "/dev " devtmpfs on /dev type devtmpfs (rw,nosuid,size=16327076k,nr_inodes=4081769,mode=755)
实际上 /dev/dm-3 块设备文件也很特殊,其 inode 不关联存储介质数据,只与驱动建立连接。
-
CSI node plugin 创建目标文件 $CSI_DIR/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 https://github.com/openebs/lvm-localpv/blob/v0.9.0/pkg/lvm/mount.go#L241
// MountBlock mounts the block disk to the specified path func MountBlock(vol *apis.LVMVolume, mountinfo *MountInfo, podLVInfo *PodLVInfo) error { target := mountinfo.MountPath volume := vol.Spec.VolGroup + "/" + vol.Name devicePath := DevPath + volume mountopt := []string{"bind"} mounter := &mount.SafeFormatAndMount{Interface: mount.New(""), Exec: utilexec.New()} // Create the mount point as a file since bind mount device node requires it to be a file err := makeFile(target) if err != nil { return status.Errorf(codes.Internal, "Could not create target file %q: %v", target, err) } // a lot of code here } // https://github.com/openebs/lvm-localpv/blob/v0.9.0/pkg/lvm/mount.go#L301-L313 func makeFile(pathname string) error { f, err := os.OpenFile(pathname, os.O_CREATE, os.FileMode(0644)) defer func(f *os.File) { err = f.Close() klog.Errorf("failed to close file %s error: %v", f.Name(), err) }(f) if err != nil { if !os.IsExist(err) { return err } } return nil }
该文件遵循 CSI 规范,为 Kubernetes Pod 所用。
-
CSI node plugin 将 /dev/lvmvg/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1 bind mount 至新创建的文件 $CSI_DIR/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 https://github.com/openebs/lvm-localpv/blob/v0.9.0/pkg/lvm/mount.go#L247
func MountBlock(vol *apis.LVMVolume, mountinfo *MountInfo, podLVInfo *PodLVInfo) error { target := mountinfo.MountPath volume := vol.Spec.VolGroup + "/" + vol.Name devicePath := DevPath + volume mountopt := []string{"bind"} // a lot of code here // do the bind mount of the device at the target path if err := mounter.Mount(devicePath, target, "", mountopt); err != nil { if removeErr := os.Remove(target); removeErr != nil { return status.Errorf(codes.Internal, "Could not remove mount target %q: %v", target, removeErr) } return status.Errorf(codes.Internal, "mount failed at %v err : %v", target, err) } // a lot of code here }
-
bind mount 后 $CSI_DIR/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 文件的类型变成了 block,而“文件系统”则变成了 devtmpfs。
$ ll /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 brw-rw---- 1 root disk 253, 3 Sep 26 09:57 /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 $ mount -l | grep /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 devtmpfs on /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 type devtmpfs (rw,nosuid,size=16327076k,nr_inodes=4081769,mode=755)
注意到 $CSI_DIR/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 的主次设备号与 LVM 逻辑卷 /dev/lvmvg/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1 相同,它们背后是同一个设备。
再看一眼 virt-launcher Pod 中的挂载表:
kubectl exec -it virt-launcher-ecs-test13-qqpt2 -n ns-demo -- mount -l
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/10648/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/10647/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/10646/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/11651/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/11651/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
/dev 在容器中是以 tmpfs 挂载出来的,而宿主机上则是 devtmpfs,后面会说到。
CRI 部分
1. kubelet
Kubernetes 在创建容器(Pod)时,将容器内的块文件 /dev/heredatadisk(由 Pod 对象定义)与宿主机上块设备文件 $CSI_DIR/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 的映射关系封装入 CRI 容器创建请求中 https://github.com/kubernetes/kubernetes/blob/v1.23.6/pkg/kubelet/kubelet_pods.go#L110-L141:
func (kl *Kubelet) makeBlockVolumes(pod *v1.Pod, container *v1.Container, podVolumes kubecontainer.VolumeMap, blkutil volumepathhandler.BlockVolumePathHandler) ([]kubecontainer.DeviceInfo, error) {
var devices []kubecontainer.DeviceInfo
for _, device := range container.VolumeDevices {
// check path is absolute
if !filepath.IsAbs(device.DevicePath) {
return nil, fmt.Errorf("error DevicePath `%s` must be an absolute path", device.DevicePath)
}
vol, ok := podVolumes[device.Name]
if !ok || vol.BlockVolumeMapper == nil {
klog.ErrorS(nil, "Block volume cannot be satisfied for container, because the volume is missing or the volume mapper is nil", "containerName", container.Name, "device", device)
return nil, fmt.Errorf("cannot find volume %q to pass into container %q", device.Name, container.Name)
}
// Get a symbolic link associated to a block device under pod device path
dirPath, volName := vol.BlockVolumeMapper.GetPodDeviceMapPath()
symlinkPath := path.Join(dirPath, volName)
if islinkExist, checkErr := blkutil.IsSymlinkExist(symlinkPath); checkErr != nil {
return nil, checkErr
} else if islinkExist {
// Check readOnly in PVCVolumeSource and set read only permission if it's true.
permission := "mrw"
if vol.ReadOnly {
permission = "r"
}
klog.V(4).InfoS("Device will be attached to container in the corresponding path on host", "containerName", container.Name, "path", symlinkPath)
devices = append(devices, kubecontainer.DeviceInfo{PathOnHost: symlinkPath, PathInContainer: device.DevicePath, Permissions: permission})
}
}
return devices, nil
}
反向追溯在 kubelet 中该过程的调用链:makeBlockVolumes <- kubelet.GenerateRunContainerOptions <- generateContainerConfig.generateContainerConfig <- kubeGenericRuntimeManager.startContainer
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, spec *startSpec, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, podIPs []string) (string, error) {
// a lot of code here
containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, podIPs, target)
if cleanupAction != nil {
defer cleanupAction()
}
containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
if err != nil {
s, _ := grpcstatus.FromError(err)
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", s.Message())
return s.Message(), ErrCreateContainer
}
// a lot of code here
}
kubelet 通过 gRPC 将容器创建请求发送至容器运行时 containerd。
2. containerd
containerd 接收到 kubelet 的容器创建请求后首先生成默认挂载 https://github.com/containerd/containerd/blob/v1.6.2/oci/spec.go#L132-L197:
func populateDefaultUnixSpec(ctx context.Context, s *Spec, id string) error {
ns, err := namespaces.NamespaceRequired(ctx)
if err != nil {
return err
}
*s = Spec{
}
s.Mounts = defaultMounts()
return nil
}
func defaultMounts() []specs.Mount {
return []specs.Mount{
{
Destination: "/proc",
Type: "proc",
Source: "proc",
Options: []string{"nosuid", "noexec", "nodev"},
},
{
Destination: "/dev",
Type: "tmpfs",
Source: "tmpfs",
Options: []string{"nosuid", "strictatime", "mode=755", "size=65536k"},
},
// a lot of code here
}
}
容器内 /dev 使用 tmpfs 挂载是 runc 即 libcontainerd 的规范。
而 /dev/heredatadisk 也作为 ContainerPath 封装入 OCI 容器结构体中 https://github.com/containerd/containerd/blob/v1.6.2/oci/spec_opts_linux.go#L285-L328:
func WithDevices(osi osinterface.OS, config *runtime.ContainerConfig, enableDeviceOwnershipFromSecurityContext bool) oci.SpecOpts {
return func(ctx context.Context, client oci.Client, c *containers.Container, s *runtimespec.Spec) (err error) {
// a lot of code here
for _, device := range config.GetDevices() {
path, err := osi.ResolveSymbolicLink(device.HostPath)
if err != nil {
return err
}
o := oci.WithDevices(path, device.ContainerPath, device.Permissions)
if err := o(ctx, client, c, s); err != nil {
return err
}
}
// a lot of code here
}
}
// https://github.com/containerd/containerd/blob/v1.6.2/oci/spec_opts_linux.go#L39-L61
func WithDevices(devicePath, containerPath, permissions string) SpecOpts {
return func(_ context.Context, _ Client, _ *containers.Container, s *Spec) error {
devs, err := getDevices(devicePath, containerPath)
if err != nil {
return err
}
for i := range devs {
s.Linux.Devices = append(s.Linux.Devices, devs[i])
s.Linux.Resources.Devices = append(s.Linux.Resources.Devices, specs.LinuxDeviceCgroup{
Allow: true,
Type: devs[i].Type,
Major: &devs[i].Major,
Minor: &devs[i].Minor,
Access: permissions,
})
}
return nil
}
}
func DeviceFromPath(path string) (*specs.LinuxDevice, error) {
var stat unix.Stat_t
if err := unix.Lstat(path, &stat); err != nil {
return nil, err
}
var (
devNumber = uint64(stat.Rdev) //nolint: unconvert // the type is 32bit on mips.
major = unix.Major(devNumber)
minor = unix.Minor(devNumber)
)
}
func Lstat(path string, stat *Stat_t) (err error) {
return Fstatat(AT_FDCWD, path, stat, AT_SYMLINK_NOFOLLOW)
}
通过 lstat 函数即 fstatat 系统调用获得 $CSI_DIR/publish/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1/47ad9107-5e1a-4a27-b8d1-1e21b6ed0f12 对应的主次设备号 253 和 4。
containerd 的容器创建请求(配置)通过容器 rootfs 上一层目录下的 config.json 文件传递给 OCI 容器运行时也就是 runc:
$ ll /run/containerd/io.containerd.runtime.v2.task/k8s.io/06d9182a20cab4de567357f3af1d24201e8093e66fb37c50153f82f3b2a56669/
total 24
-rw-r--r-- 1 root root 89 Sep 26 10:18 address
-rw-r--r-- 1 root root 7242 Sep 26 10:18 config.json
-rw-r--r-- 1 root root 5 Sep 26 10:18 init.pid
prwx------ 1 root root 0 Sep 26 10:18 log
-rw-r--r-- 1 root root 0 Sep 26 10:18 log.json
-rw------- 1 root root 2 Sep 26 10:18 options.json
drwxr-xr-x 1 root root 74 Sep 26 10:18 rootfs
-rw------- 1 root root 0 Sep 26 10:18 runtime
-rw------- 1 root root 38 Sep 26 10:18 shim-binary-path
lrwxrwxrwx 1 root root 121 Sep 26 10:18 work -> /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/06d9182a20cab4de567357f3af1d24201e8093e66fb37c50153f82f3b2a56669
$ cat /run/containerd/io.containerd.runtime.v2.task/k8s.io/06d9182a20cab4de567357f3af1d24201e8093e66fb37c50153f82f3b2a56669/config.json
"devices":[
{
"path":"/dev/kvm",
"type":"c",
"major":10,
"minor":232,
"fileMode":438,
"uid":0,
"gid":36
},
{
"path":"/dev/net/tun",
"type":"c",
"major":10,
"minor":200,
"fileMode":438,
"uid":0,
"gid":0
},
{
"path":"/dev/vhost-net",
"type":"c",
"major":10,
"minor":238,
"fileMode":384,
"uid":0,
"gid":0
},
{
"path":"/dev/bootdisk",
"type":"b",
"major":252,
"minor":16,
"fileMode":432,
"uid":0,
"gid":6
},
{
"path":"/dev/heredatadisk",
"type":"b",
"major":253,
"minor":3,
"fileMode":432,
"uid":0,
"gid":6
}
],
其中 /dev/bootdisk 与主次设备号(major 253; minor 3)在同一个数据结构中。
OCI
最后来到 OCI 容器运行时 runc。
追溯创建容器时的调用链 prepareRootfs -> createDevices -> createDeviceNode -> mknodDevice
-
func createDevices(config *configs.Config) error { useBindMount := userns.RunningInUserNS() || config.Namespaces.Contains(configs.NEWUSER) oldMask := unix.Umask(0o000) for _, node := range config.Devices { // The /dev/ptmx device is setup by setupPtmx() if utils.CleanPath(node.Path) == "/dev/ptmx" { continue } // containers running in a user namespace are not allowed to mknod // devices so we can just bind mount it from the host. if err := createDeviceNode(config.Rootfs, node, useBindMount); err != nil { unix.Umask(oldMask) return err } } unix.Umask(oldMask) return nil }
-
func createDeviceNode(rootfs string, node *devices.Device, bind bool) error { if node.Path == "" { // The node only exists for cgroup reasons, ignore it here. return nil } dest, err := securejoin.SecureJoin(rootfs, node.Path) if err != nil { return err } if err := os.MkdirAll(filepath.Dir(dest), 0o755); err != nil { return err } if bind { return bindMountDeviceNode(rootfs, dest, node) } if err := mknodDevice(dest, node); err != nil { if errors.Is(err, os.ErrExist) { return nil } else if errors.Is(err, os.ErrPermission) { return bindMountDeviceNode(rootfs, dest, node) } return err } return nil }
-
func mknodDevice(dest string, node *devices.Device) error { fileMode := node.FileMode switch node.Type { case devices.BlockDevice: fileMode |= unix.S_IFBLK case devices.CharDevice: fileMode |= unix.S_IFCHR case devices.FifoDevice: fileMode |= unix.S_IFIFO default: return fmt.Errorf("%c is not a valid device type for device %s", node.Type, node.Path) } dev, err := node.Mkdev() if err != nil { return err } if err := unix.Mknod(dest, uint32(fileMode), int(dev)); err != nil { return &os.PathError{Op: "mknod", Path: dest, Err: err} } return os.Chown(dest, int(node.Uid), int(node.Gid)) } // https://github.com/opencontainers/runc/blob/v1.1.1/libcontainer/devices/device_unix.go#L23-L28 func mkDev(d *Rule) (uint64, error) { if d.Major == Wildcard || d.Minor == Wildcard { return 0, errors.New("cannot mkdev() device with wildcards") } return unix.Mkdev(uint32(d.Major), uint32(d.Minor)), nil }
本质上是通过 mknod 系统调用在容器内的 /dev 下创建块设备文件 heredatadisk,这个设备文件通过主次设备号(major 253; minor 3)控制设备驱动程序操作设备,即宿主机上的 LVM 逻辑卷设备 /dev/lvmvg/pvc-aa9d7716-d027-438a-8d76-c95b46a31ce1。
总结
创建容器的过程是由多个组件串联而成的,从 kubelet 到 containerd 到 runc,组件与组件的通信之间由各种规范(CSI、CRI、OCI)定义接口。容器内的块设备文件,归根到底还是由 OCI 容器运行时 mknod 块设备文件,文件关联的主次设备号,则是由一些列的 gRPC 甚至文件层层传递而来。