容器 Overlay 文件系统

Apr 23, 2022 16:00 · 1789 words · 4 minute read Linux Container FileSystem

上过小学三年级的我们都知道容器的 rootfs(容器的根目录,为容器进程提供隔离后执行环境的文件系统)是由分层的镜像文件联合挂载出来的。

而 OverlayFS 是 Linux 联合文件系统实现的一种,于 2014 年被合并入 Linux 内核主干 3.18 版本,目前被各种容器运行时广泛使用。

主流的 Linux 联合文件系统:

  • Overlay(Overlay2)
  • AUFS
  • Btrfs(BetterFS)

我们来做个实验手动联合挂载一把:

$ mkdir -p /root/test/lower /root/test/upper /root/test/work /root/test/merged
$ mount -t overlay overlay -o lowerdir=/root/test/lower,upperdir=/root/test/upper,workdir=/root/test/work /root/test/merged
$ mount -l | grep /root/test/merged
overlay on /root/test/merged type overlay (rw,relatime,seclabel,lowerdir=/root/test/lower,upperdir=/root/test/upper,workdir=/root/test/work)

注意挂载选项中有三个不同的路径。

向联合挂载好的 /root/test/merged 路径写入一个文件:

$ echo "hello, world!" > /root/test/merged/hello.txt
$ tree /root/test
/root/test
├── lower
├── merged
│   └── hello.txt
├── upper
│   └── hello.txt
└── work
    └── work

hello.txt 文件同时也出现在了 /root/test/upper 路径中,因为upperdir 指定的路径 /root/test/upper 是 overlay 文件系统的读写层

向 /root/test/lower 路径写入一个文件:

$ echo "try" > /root/test/lower/try.txt
$ tree /root/test
/root/test
├── lower
│   └── try.txt
├── merged
│   ├── hello.txt
│   └── try.txt
├── upper
│   └── hello.txt
└── work
    └── work

try.txt 文件同时也出现在了 /root/test/merged 路径中,我们尝试修改它:

$ echo "ohhhhh" > /root/test/merged/try.txt
$ cat /root/test/merged/try.txt
ohhhhh
$ cat /root/test/lower/try.txt
try

虽然 /root/test/merged 路径中的 try.txt 文件被修改了,但是 /root/test/lower 路径下的 try.txt 文件却和原来一样,这是因为lowerdir 指定的路径只读。

向 lowerdir 和 upperdir 同时写入同名文件:

$ echo "lower" > /root/test/lower/both.txt
$ cat /root/test/merged/both.txt
lower
$ echo "upper" > /root/test/upper/both.txt
$ cat /root/test/merged/both.txt
lower
$ rm /root/test/merged/both.txt
$ tree /root/test
/root/test
├── lower
│   ├── both.txt
│   └── try.txt
├── merged
│   ├── hello.txt
│   └── try.txt
├── upper
│   ├── both.txt
│   ├── hello.txt
│   └── try.txt
└── work
    └── work
$ ll /root/test/upper
total 8.0K
c---------. 1 root root 0, 0 Apr 23 01:34 both.txt
-rw-r--r--. 1 root root   13 Apr 23 01:06 hello.txt
-rw-r--r--. 1 root root    7 Apr 23 01:15 try.txt

删除 /root/test/merged 路径下的 both.txt 后,upperdir 中的同名文件并没有消失,而是变成了字符设备(character device)。

这是 Overley 文件系统使用的一种白障(whiteout)技术,在删除文件或路径时,需要在 upperdir 中标记文件已被删除。当 upperdir 中存在 lowerdir 下同名的白障,那么在联合挂载点中该文件会被忽略,不会显示(包括白障本身)。

还可以同时联合挂载多个 lowerdir:mount -t overlay overlay -o lowerdir=/path/to/lower1:/path/to/lower2:/path/to/lower3,upperdir=... /merged,容器镜像通常都有很多层。

Docker

目前 Docker 默认使用 Overlay2 作为存储驱动,而本文基于 Overlay 存储驱动。

上图展示了 Docker 镜像和 Docker 容器的分层结构:镜像层是 lowerdir;容器层是 upperdir,联合挂载点 merged 就是容器的挂载点(rootfs)。在 Docker 中,镜像层都被解压到了 /var/lib/docker/overlay 或 /var/lib/docker/overlay2(Overlay2 作为存储驱动)路径下:

$ tree -L 2 /var/lib/docker/overlay
/var/lib/docker/overlay
├── 0c59e80a1c2b2afa15a25437d389c9ac26ae6e65e55bc496a4bcea3f502194b1
│   └── root
├── 4123d5a0f2b7344f85f6fdd8fb70263fc9e8bad8fdbf27325de0e182d050d8e2
│   └── root
├── 61b624f60ceae019c90e8d3320a4cdceed49ee440100387cb16bebc9c7c06b58
│   └── root
├── 7255aa29ce2271f2d5c41db3185604a52aa17a83868929cd10cdb5b9337420a7
│   └── root
├── cf30980bab8d745b8897fe3140d308fd0723aa623d558349f4ea425652d39cbf
│   └── root
└── fd1a6ba31d6e9b11497ced545031bdcb3f5d9ba6933a0699578b14c68f513347
    └── root

容器层也在 /var/lib/docker/overlay 路径下:

$ ll /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da
total 4.0K
-rw-------. 1 root root 64 Apr 23 02:40 lower-id
drwxr-xr-x. 1 root root 68 Apr 23 02:40 merged
drwxr-xr-x. 6 root root 68 Apr 23 02:40 upper
drwx------. 3 root root 18 Apr 23 02:40 work
$ tree -L 2 /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da
/var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da
├── lower-id
├── merged
│   ├── bin
│   ├── boot
│   ├── dev
│   ├── docker-entrypoint.d
│   ├── docker-entrypoint.sh
│   ├── etc
│   ├── home
│   ├── lib
│   ├── lib64
│   ├── media
│   ├── mnt
│   ├── opt
│   ├── proc
│   ├── root
│   ├── run
│   ├── sbin
│   ├── srv
│   ├── sys
│   ├── tmp
│   ├── usr
│   └── var
├── upper
│   ├── dev
│   ├── etc
│   ├── run
│   └── var
└── work
    └── work
  • lower-id 文件包含了该容器所使用的首层镜像 ID

    $ cat /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/lower-id
    61b624f60ceae019c90e8d3320a4cdceed49ee440100387cb16bebc9c7c06b58
    
  • upper 子路径是容器的读写层,也就是 Overlay 文件系统的 upperdir

  • merged 子路径是 lowerdir 和 upperdir 的联合挂载点

  • work 子路径为 Overlay 文件系统内部所使用

$ monut -l | grep overlay
overlay on /var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/merged type overlay (rw,relatime,seclabel,lowerdir=/var/lib/docker/overlay/61b624f60ceae019c90e8d3320a4cdceed49ee440100387cb16bebc9c7c06b58/root,upperdir=/var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/upper,workdir=/var/lib/docker/overlay/ced3ee6e64c9f49401d1d3bf164e43cbea4166d830947b7cc27d28df377095da/work)

containerd

虽然 containerd 摒弃 Docker graph driver 转向 snapshot 文件系统,但底层本质上还是 Overlay:

$ overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/4c71f21f7e94d43c8c01545bc3fade76f4cb94770f8534b1e3c6a568c88d0cbe/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/604/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/604/work)
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/e7f99c00501c9c7bcc387a97437ce020e4bb543f9c8c4afb40394cd8438ce79a/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/605/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/605/work)
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/be936f97bc2fc92dee36ae51f5210cee039980fb1b310451cc52e88cdb642cc9/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/16/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/15/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/607/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/607/work)
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/117d504a637f71222c6ab9214909ec83512401deeeec2633bf10c6fbe131f51d/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/441/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/440/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/439/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/606/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/606/work)

通过 mount 能够看到 containerd 作为运行时的容器的 rootfs 是如何被联合挂载出来的:

  • containerd 的镜像文件存储在 /var/lib/containerd/io.containerd.content.v1.content 路径下:

    $ tree -L 2 /var/lib/containerd/io.containerd.content.v1.content/blobs/
    /var/lib/containerd/io.containerd.content.v1.content/blobs/
    └── sha256
        ├── 019d8da33d911d9baabe58ad63dea2107ed15115cca0fc27fc0f627e82a695c1
        ├── 052816d6a6844d1e04c19c4dd1f1b55b51fba98732d8ec4c8b92251d1739c704
        ├── 05c1a3be66823dcaca55ebe17c3c9a60de7ceb948047da3e95308348325ddd5a
        ├── 0c6b9ab3ebf9850e30ec8741d87cf101d97eebd3a934d0055850f119237ca1f2
        ├── 0dfc4f1512064e909fa8474ac08c49a5699546b03a7c3e87166d7b77eed640b0
        ├── 0f23e58bd0b7c74311703e20c21c690a6847e62240ed456f8821f4c067d3659b
        ├── 13bf18cc869803a1aedf81330d8ba4c3c3c10c175ba45a0dc866f347b31c7004
        ├── 1ff6c18fbef2045af6b9c16bf034cc421a29027b800e4f9b68ae9b1cb3e9ae07
    
  • 在基于 Overlay 的 snapshot 文件系统中,快照就从镜像的每层创建并提交,存储在 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs 路径下:

    tree -L 2 /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
    /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
    ├── metadata.db
    └── snapshots
        ├── 1
        ├── 10
        ├── 11
        ├── 12
        ├── 13
        ├── 14
        ├── 15
        ├── 16
        ├── 17
        ├── 2
        ├── 3
        ├── 32
        ├── 33
        ├── 35
        ├── 36
        ├── 37
        ├── 38
        ├── 39
        ├── 4
    

    即 Overlay 联合挂载时 lowerdir 参数的值

  • 镜像文件的最后一层必须被创建一个激活状态的快照,它就是容器的 rootfs:

    $ ctr -n k8s.io snapshot ls
    KEY                                                                     02072e6ad3505f704ce1842634241dbb4275ac0f7f1459096658029e239177ce        sha256:f07b5946e28c791718f26d42fa69f2e2b89df33b82ba073627819acdd08e1e9f Active
    0b6955314c4bf50246abf9c100901de87b6b4b20010e86e7d349b2ccb098f9f8        sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active
    0d158b7fdd0066ac5a3eccc03463b8e19ad88b6d79314969fd5e187450cfb5b4        sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active
    117d504a637f71222c6ab9214909ec83512401deeeec2633bf10c6fbe131f51d        sha256:19606512dfe192788a55d7c1efb9ec02041b4e318587632f755c5112f927e0e3 Active
    306ae926d5fab8f826fe1c01331b5ca40848bcedb5df6519bd5ccb962ff57281        sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active
    33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558        sha256:dee215ffc666313e1381d3e6e4299a4455503735b8df31c3fa161d2df50860a8 Active
    $  mount -l | grep 33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558
    shm on /run/containerd/io.containerd.grpc.v1.cri/sandboxes/33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
    overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/33710f2eafa30783857f66aa00c73f411a7878968cb5468bd74f167790ecf558/rootfs type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/598/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/598/work)
    

什么是 graph driver?https://blog.crazytaxii.com/posts/where_are_containerds_graph_drivers/


当联合挂载好后,容器运行时会使用 pivot_root 或 chroot 为容器进程切换根目录,这样我们通过 bash attach 进容器后就能看到一个完整的 rootfs。