容器 Overlay 文件系统
Apr 23, 2022 16:00 · 1789 words · 4 minute read
上过小学三年级的我们都知道容器的 rootfs(容器的根目录,为容器进程提供隔离后执行环境的文件系统)是由分层的镜像文件联合挂载出来的。
而 OverlayFS 是 Linux 联合文件系统实现的一种,于 2014 年被合并入 Linux 内核主干 3.18 版本,目前被各种容器运行时广泛使用。
主流的 Linux 联合文件系统:
- Overlay(Overlay2)
- AUFS
- Btrfs(BetterFS)
我们来做个实验手动联合挂载一把:
$ mkdir -p /root/test/lower /root/test/upper /root/test/work /root/test/merged
$ mount -t overlay overlay -o lowerdir=/root/test/lower,upperdir=/root/test/upper,workdir=/root/test/work /root/test/merged
$ mount -l | grep /root/test/merged
overlay on /root/test/merged type overlay (rw,relatime,seclabel,lowerdir=/root/test/lower,upperdir=/root/test/upper,workdir=/root/test/work)
注意挂载选项中有三个不同的路径。
向联合挂载好的 /root/test/merged 路径写入一个文件:
$ echo "hello, world!" > /root/test/merged/hello.txt
$ tree /root/test
/root/test
├── lower
├── merged
│ └── hello.txt
├── upper
│ └── hello.txt
└── work
└── work
hello.txt 文件同时也出现在了 /root/test/upper 路径中,因为upperdir 指定的路径 /root/test/upper 是 overlay 文件系统的读写层。
向 /root/test/lower 路径写入一个文件:
$ echo "try" > /root/test/lower/try.txt
$ tree /root/test
/root/test
├── lower
│ └── try.txt
├── merged
│ ├── hello.txt
│ └── try.txt
├── upper
│ └── hello.txt
└── work
└── work
try.txt 文件同时也出现在了 /root/test/merged 路径中,我们尝试修改它:
$ echo "ohhhhh" > /root/test/merged/try.txt
$ cat /root/test/merged/try.txt
ohhhhh
$ cat /root/test/lower/try.txt
try
虽然 /root/test/merged 路径中的 try.txt 文件被修改了,但是 /root/test/lower 路径下的 try.txt 文件却和原来一样,这是因为lowerdir 指定的路径只读。
向 lowerdir 和 upperdir 同时写入同名文件:
$ echo "lower" > /root/test/lower/both.txt
$ cat /root/test/merged/both.txt
lower
$ echo "upper" > /root/test/upper/both.txt
$ cat /root/test/merged/both.txt
lower
$ rm /root/test/merged/both.txt
$ tree /root/test
/root/test
├── lower
│ ├── both.txt
│ └── try.txt
├── merged
│ ├── hello.txt
│ └── try.txt
├── upper
│ ├── both.txt
│ ├── hello.txt
│ └── try.txt
└── work
└── work
$ ll /root/test/upper
total 8.0K
c---------. 1 root root 0, 0 Apr 23 01:34 both.txt
-rw-r--r--. 1 root root 13 Apr 23 01:06 hello.txt
-rw-r--r--. 1 root root 7 Apr 23 01:15 try.txt
删除 /root/test/merged 路径下的 both.txt 后,upperdir 中的同名文件并没有消失,而是变成了字符设备(character device)。
这是 Overley 文件系统使用的一种白障(whiteout)技术,在删除文件或路径时,需要在 upperdir 中标记文件已被删除。当 upperdir 中存在 lowerdir 下同名的白障,那么在联合挂载点中该文件会被忽略,不会显示(包括白障本身)。
还可以同时联合挂载多个 lowerdir:mount -t overlay overlay -o lowerdir=/path/to/lower1:/path/to/lower2:/path/to/lower3,upperdir=... /merged
,容器镜像通常都有很多层。
Docker
目前 Docker 默认使用 Overlay2 作为存储驱动,而本文基于 Overlay 存储驱动。
上图展示了 Docker 镜像和 Docker 容器的分层结构:镜像层是 lowerdir;容器层是 upperdir,联合挂载点 merged 就是容器的挂载点(rootfs)。在 Docker 中,镜像层都被解压到了 /var/lib/docker/overlay 或 /var/lib/docker/overlay2(Overlay2 作为存储驱动)路径下:
$ tree -L 2 /var/lib/docker/overlay
/var/lib/docker/overlay
├── 0c59e80a1c2b2afa15a25437d389c9ac26ae6e65e55bc496a4bcea3f502194b1
│ └── root
├── 4123d5a0f2b7344f85f6fdd8fb70263fc9e8bad8fdbf27325de0e182d050d8e2
│ └── root
├── 61b624f60ceae019c90e8d3320a4cdceed49ee440100387cb16bebc9c7c06b58
│ └── root
├── 7255aa29ce2271f2d5c41db3185604a52aa17a83868929cd10cdb5b9337420a7
│ └── root
├── cf30980bab8d745b8897fe3140d308fd0723aa623d558349f4ea425652d39cbf
│ └── root
└── fd1a6ba31d6e9b11497ced545031bdcb3f5d9ba6933a0699578b14c68f513347
└── root
容器层也在 /var/lib/docker/overlay 路径下:
$ ll /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da
total 4.0K
-rw-------. 1 root root 64 Apr 23 02:40 lower-id
drwxr-xr-x. 1 root root 68 Apr 23 02:40 merged
drwxr-xr-x. 6 root root 68 Apr 23 02:40 upper
drwx------. 3 root root 18 Apr 23 02:40 work
$ tree -L 2 /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da
/var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da
├── lower-id
├── merged
│ ├── bin
│ ├── boot
│ ├── dev
│ ├── docker-entrypoint.d
│ ├── docker-entrypoint.sh
│ ├── etc
│ ├── home
│ ├── lib
│ ├── lib64
│ ├── media
│ ├── mnt
│ ├── opt
│ ├── proc
│ ├── root
│ ├── run
│ ├── sbin
│ ├── srv
│ ├── sys
│ ├── tmp
│ ├── usr
│ └── var
├── upper
│ ├── dev
│ ├── etc
│ ├── run
│ └── var
└── work
└── work
-
lower-id 文件包含了该容器所使用的首层镜像 ID
$ cat /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/lower-id 61b624f60ceae019c90e8d3320a4cdceed49ee440100387cb16bebc9c7c06b58
-
upper 子路径是容器的读写层,也就是 Overlay 文件系统的 upperdir
-
merged 子路径是 lowerdir 和 upperdir 的联合挂载点
-
work 子路径为 Overlay 文件系统内部所使用
$ monut -l | grep overlay
overlay on /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/merged type overlay (rw,relatime,seclabel,lowerdir=/var/lib/docker/overlay/61b624f60ceae019c90e8d3320a4cdceed49ee440100387cb16bebc9c7c06b58/root,upperdir=/var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/upper,workdir=/var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/work)
containerd
虽然 containerd 摒弃 Docker graph driver 转向 snapshot 文件系统,但底层本质上还是 Overlay:
$ overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/4c71f21f7e94d43c8c01545bc3fade76f4cb94770f8534b1e3c6a568c88d0cbe/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/604/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/604/work)
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/e7f99c00501c9c7bcc387a97437ce020e4bb543f9c8c4afb40394cd8438ce79a/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/605/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/605/work)
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/be936f97bc2fc92dee36ae51f5210cee039980fb1b310451cc52e88cdb642cc9/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/16/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/15/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/607/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/607/work)
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/117d504a637f71222c6ab9214909ec83512401deeeec2633bf10c6fbe131f51d/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/441/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/440/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/439/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/606/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/606/work)
通过 mount
能够看到 containerd 作为运行时的容器的 rootfs 是如何被联合挂载出来的:
-
containerd 的镜像文件存储在 /var/lib/containerd/io.containerd.content.v1.content 路径下:
$ tree -L 2 /var/lib/containerd/io.containerd.content.v1.content/blobs/ /var/lib/containerd/io.containerd.content.v1.content/blobs/ └── sha256 ├── 019d8da33d911d9baabe58ad63dea2107ed15115cca0fc27fc0f627e82a695c1 ├── 052816d6a6844d1e04c19c4dd1f1b55b51fba98732d8ec4c8b92251d1739c704 ├── 05c1a3be66823dcaca55ebe17c3c9a60de7ceb948047da3e95308348325ddd5a ├── 0c6b9ab3ebf9850e30ec8741d87cf101d97eebd3a934d0055850f119237ca1f2 ├── 0dfc4f1512064e909fa8474ac08c49a5699546b03a7c3e87166d7b77eed640b0 ├── 0f23e58bd0b7c74311703e20c21c690a6847e62240ed456f8821f4c067d3659b ├── 13bf18cc869803a1aedf81330d8ba4c3c3c10c175ba45a0dc866f347b31c7004 ├── 1ff6c18fbef2045af6b9c16bf034cc421a29027b800e4f9b68ae9b1cb3e9ae07
-
在基于 Overlay 的 snapshot 文件系统中,快照就从镜像的每层创建并提交,存储在 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs 路径下:
tree -L 2 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs ├── metadata.db └── snapshots ├── 1 ├── 10 ├── 11 ├── 12 ├── 13 ├── 14 ├── 15 ├── 16 ├── 17 ├── 2 ├── 3 ├── 32 ├── 33 ├── 35 ├── 36 ├── 37 ├── 38 ├── 39 ├── 4
即 Overlay 联合挂载时 lowerdir 参数的值
-
镜像文件的最后一层必须被创建一个激活状态的快照,它就是容器的 rootfs:
$ ctr -n k8s.io snapshot ls KEY 02072e6ad3505f704ce1842634241dbb4275ac0f7f1459096658029e239177ce sha256:f07b5946e28c791718f26d42fa69f2e2b89df33b82ba073627819acdd08e1e9f Active 0b6955314c4bf50246abf9c100901de87b6b4b20010e86e7d349b2ccb098f9f8 sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active 0d158b7fdd0066ac5a3eccc03463b8e19ad88b6d79314969fd5e187450cfb5b4 sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active 117d504a637f71222c6ab9214909ec83512401deeeec2633bf10c6fbe131f51d sha256:19606512dfe192788a55d7c1efb9ec02041b4e318587632f755c5112f927e0e3 Active 306ae926d5fab8f826fe1c01331b5ca40848bcedb5df6519bd5ccb962ff57281 sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active 33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558 sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active $ mount -l | grep 33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558 shm on /run/containerd/io.containerd.grpc.v1.cri/sandboxes/33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k) overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/598/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/598/work)
什么是 graph driver?https://blog.crazytaxii.com/posts/where_are_containerds_graph_drivers/
当联合挂载好后,容器运行时会使用 pivot_root 或 chroot 为容器进程切换根目录,这样我们通过 bash attach 进容器后就能看到一个完整的 rootfs。