为什么 Kubernetes 不使用 libnetwork

Jan 20, 2022 22:00 · 3429 words · 7 minute read Kubernetes Docker Network

译文

Kubernetes 在 1.0 版本发布前就已经有了一个非常基础的网络插件形式——大约与 Docker 的 libnetwork 和 Container Network Model(CNM) 推出时间相同。与 libnetwork 不同，Kubernetes 的插件系统仍然保留着“alpha”的称号。现在 Docker 的网络插件已经发布并得到支持，我们收到的一个显眼的问题是为什么 Kubernetes 还没有采纳它？毕竟，供应商几乎一定会为 Docker 编写插件——我们都使用相同的驱动程序会更好，对吗？

在进一步讨论前，重要的事情说三遍，Kubernetes 是一个支持多种容器运行时的系统，Docker 只是其中之一。配置网络是每种运行时的一个方面，所以当人们问“Kubernetes 是否将支持 CNM？”他们真正的意思是“Kubernetes 是否会支持 CNM 驱动程序与 Docker 运行时？”如果我们能实现跨运行时的共同网络支持，那就太棒了，但这并不是一个明确的目标。

确实，Kubernetes 并没有为 Docker 运行时采用 CNM/libnetwork。事实上我们一直在研究由 CoreOS 提出的替代方案 Container Network Interface(CNI)，也是 App Container(appc) 规范的一部分。为什么呢？有很多原因，包括技术上的和非技术上的。

首先，也是最重要的，在 Docker 的网络驱动设计中，有一些基本假设使我们陷入了困扰。

Docker 有一个“本地”和“全局”的驱动程序概念。本地驱动（比如网桥）以机器为本，不做任何跨节点协调。全局驱动（比如 overlay）依赖 libkv（一个键值存储的抽象）来跨机器协调。这个键值存储是另一个插件接口，而且非常“底层”。要在 Kubernetes 集群中运行类似 Docker overlay 驱动的东西，要么得让集群管理员运行一个完全不同的 consul、etcd 或 zookeeper 实例，要么就得提供我们自己的 libkv 实现，由 Kubernetes 背书。

后者听起来很有吸引力，我们尝试去实现它。但是 libkv 的接口非常低级，而且 schema 在 Docker 内部定义。我们不得不直接暴露底层键值存储或者提供键值语义（在我们的结构化 API 之上，而 API 本身就是在键值系统上实现的）。出于性能、可扩展性还有安全方面的考虑都不太好。最终整个系统将显著地复杂化，但使用 Docker 网络原本是为了简化。

对于那些愿意而且有能力运行必要的基础设施以满足 Docker 全局驱动并自己配置 Docker 的用户来说，Docker 网络应该行得通。Kubernetes 不会阻碍他们，无论项目如何发展，都应该给用户这样的选项。但之于默认安装，实际上这对用户来说是一个过于沉重的负担，因此我们不能使用 Docker 全局驱动（包括 overlay），这就完全消除了使用 Docker 插件的价值。

Docker 的网络模型做了很多假设，但它们对 Kubernetes 来说是无效的。在 Docker 1.8 和 1.9 版中，它包括了一个有根本缺陷的“发现”实现，导致容器中的 /etc/hosts 文件被破坏（docker #17190）——还不能轻易解决。Docker 正计划在 1.10 版中捆绑一个新的 DNS 服务器，目前还不清楚这是否能够被关闭。容器级命名对 Kubernetes 来说不是正确的抽象——我们已经有自己的服务命名、发现和绑定概念，我们也已经有自己的 DNS 服务器（基于成熟的 SkyDNS）。捆绑解决方案不能满足我们的需求，但也不能禁用。

与本地/全局正交，Docker 有进程内和进程外（“远程”）插件。我们调查了是否可以绕过 libnetwork（从而跳过上述问题）直接驱动 Docker 远程插件。不幸的是，这意味着我们不能使用任何 Docker 的进程内插件，特别是 bridge 和 overlay，这又抵消了 libnetwork 的大部分作用。

另一方面，CNI 在设计哲学上与 Kubernetes 更一致。它比 CNM 简单得多，不需要守护进程，而且跨平台（CoreOS 的 rkt 容器运行时支持它）。跨平台意味着有机会使得网络配置在不同的运行时（Docker、Rocket、Hyper）中实现同样的效果。它遵循 UNIX 的理念，做好擅长的那件事。

此外，封装一个 CNI 插件来生成一个更定制化的 CNI 插件没啥难度——一个简单的 shell 脚本就可以搞定。CNM 在这方面要复杂多了。这使得 CNI 成为敏捷开发和迭代的一个富有吸引力的选项。早期的原型已经证明了，可以将 kubelet 中当前硬编码的网络逻辑完全移植到插件中。

我们研究了为 Docker 编写一个运行 CNI 驱动的桥接 CNM 驱动，结果搞的非常复杂。首先 CNM 和 CNI 模型截然不同，没一个“方法”是一致的。还有上面讨论的全局与本地键值问题。假设这个驱动会声明自己是本地的，我们必须从 Kubernetes 获得逻辑网络的信息。

很不幸，Docker 驱动程序很难映射到其他 Kubernetes 这样的控制平面。具体来说，驱动程序不会被告知容器所在的网络名称，只有一个 Docker 内部分配的 ID。这使得驱动程序很难匹配其他系统中存在的网络概念。

网络供应商已经向 Docker 开发者提出了这些那些问题，并通常会以“按预期工作”的理由关闭（libnetwork #139、libnetwork #486、libnetwork #514、libnetwork #865、docker #18864），纵使这些问题使得非 Docker 第三方系统更难集成。在整个调研过程中，Docker 明确表示，他们对偏离当前路线的想法不太开放。这让我们非常担心，因为 Kubernetes 是 Docker 的补集，增加了如此之多的功能，但却存在于 Docker 之外。

综上所述我们选择 CNI 作为 Kubernetes 的插件模型。这会带来一些副作用，大多数问题都不大（例如 docker inspect 将不显示 IP 地址），但有些就很严重。尤其是，由 docker run 启动的容器可能无法与 Kubernetes 启动的容器通信，而网络供应商如果想要与 Kubernetes 完全集成就必须提供 CNI 驱动程序。好的方面，Kubernetes 将变得更简洁、灵活，早期的很多丑象（如配置 Docker 使用我们的网桥）将消失。

当沿着这条道路前进的时候，我们一定会眼观六路耳听八方，寻找更好的方式来整合与简化。如果你有什么好的想法，我们真的很想听听——通过 slack 或 SIG 邮件列表联系我们~

原文

Kubernetes has had a very basic form of network plugins since before version 1.0 was released — around the same time as Docker’s libnetwork and Container Network Model (CNM) was introduced. Unlike libnetwork, the Kubernetes plugin system still retains its “alpha” designation. Now that Docker’s network plugin support is released and supported, an obvious question we get is why Kubernetes has not adopted it yet. After all, vendors will almost certainly be writing plugins for Docker — we would all be better off using the same drivers, right?

Before going further, it’s important to remember that Kubernetes is a system that supports multiple container runtimes, of which Docker is just one. Configuring networking is a facet of each runtime, so when people ask “will Kubernetes support CNM?” what they really mean is “will kubernetes support CNM drivers with the Docker runtime?” It would be great if we could achieve common network support across runtimes, but that’s not an explicit goal.

Indeed, Kubernetes has not adopted CNM/libnetwork for the Docker runtime. In fact, we’ve been investigating the alternative Container Network Interface (CNI) model put forth by CoreOS and part of the App Container (appc) specification. Why? There are a number of reasons, both technical and non-technical.

First and foremost, there are some fundamental assumptions in the design of Docker’s network drivers that cause problems for us.

Docker has a concept of “local” and “global” drivers. Local drivers (such as “bridge”) are machine-centric and don’t do any cross-node coordination. Global drivers (such as “overlay”) rely on libkv (a key-value store abstraction) to coordinate across machines. This key-value store is a another plugin interface, and is very low-level (keys and values, no semantic meaning). To run something like Docker’s overlay driver in a Kubernetes cluster, we would either need cluster admins to run a whole different instance of consul, etcd or zookeeper (see multi-host networking), or else we would have to provide our own libkv implementation that was backed by Kubernetes.

The latter sounds attractive, and we tried to implement it, but the libkv interface is very low-level, and the schema is defined internally to Docker. We would have to either directly expose our underlying key-value store or else offer key-value semantics (on top of our structured API which is itself implemented on a key-value system). Neither of those are very attractive for performance, scalability and security reasons. The net result is that the whole system would significantly be more complicated, when the goal of using Docker networking is to simplify things.

For users that are willing and able to run the requisite infrastructure to satisfy Docker global drivers and to configure Docker themselves, Docker networking should “just work.” Kubernetes will not get in the way of such a setup, and no matter what direction the project goes, that option should be available. For default installations, though, the practical conclusion is that this is an undue burden on users and we therefore cannot use Docker’s global drivers (including “overlay”), which eliminates a lot of the value of using Docker’s plugins at all.

Docker’s networking model makes a lot of assumptions that aren’t valid for Kubernetes. In docker versions 1.8 and 1.9, it includes a fundamentally flawed implementation of “discovery” that results in corrupted /etc/hosts files in containers (docker #17190) — and this cannot be easily turned off. In version 1.10 Docker is planning to bundle a new DNS server, and it’s unclear whether this will be able to be turned off. Container-level naming is not the right abstraction for Kubernetes — we already have our own concepts of service naming, discovery, and binding, and we already have our own DNS schema and server (based on the well-established SkyDNS). The bundled solutions are not sufficient for our needs but are not disableable.

Orthogonal to the local/global split, Docker has both in-process and out-of-process (“remote”) plugins. We investigated whether we could bypass libnetwork (and thereby skip the issues above) and drive Docker remote plugins directly. Unfortunately, this would mean that we could not use any of the Docker in-process plugins, “bridge” and “overlay” in particular, which again eliminates much of the utility of libnetwork.

On the other hand, CNI is more philosophically aligned with Kubernetes. It’s far simpler than CNM, doesn’t require daemons, and is at least plausibly cross-platform (CoreOS’s rkt container runtime supports it). Being cross-platform means that there is a chance to enable network configurations which will work the same across runtimes (e.g. Docker, Rocket, Hyper). It follows the UNIX philosophy of doing one thing well.

Additionally, it’s trivial to wrap a CNI plugin and produce a more customized CNI plugin — it can be done with a simple shell script. CNM is much more complex in this regard. This makes CNI an attractive option for rapid development and iteration. Early prototypes have proven that it’s possible to eject almost 100% of the currently hard-coded network logic in kubelet into a plugin.

We investigated writing a “bridge” CNM driver for Docker that ran CNI drivers. This turned out to be very complicated. First, the CNM and CNI models are very different, so none of the “methods” lined up. We still have the global vs. local and key-value issues discussed above. Assuming this driver would declare itself local, we have to get info about logical networks from Kubernetes.

Unfortunately, Docker drivers are hard to map to other control planes like Kubernetes. Specifically, drivers are not told the name of the network to which a container is being attached — just an ID that Docker allocates internally. This makes it hard for a driver to map back to any concept of network that exists in another system.

This and other issues have been brought up to Docker developers by network vendors, and are usually closed as “working as intended” (libnetwork #139, libnetwork #486, libnetwork #514, libnetwork #865, docker #18864), even though they make non-Docker third-party systems more difficult to integrate with. Throughout this investigation Docker has made it clear that they’re not very open to ideas that deviate from their current course or that delegate control. This is very worrisome to us, since Kubernetes complements Docker and adds so much functionality, but exists outside of Docker itself.

For all of these reasons we have chosen to invest in CNI as the Kubernetes plugin model. There will be some unfortunate side-effects of this. Most of them are relatively minor (for example, docker inspect will not show an IP address), but some are significant. In particular, containers started by docker run might not be able to communicate with containers started by Kubernetes, and network integrators will have to provide CNI drivers if they want to fully integrate with Kubernetes. On the other hand, Kubernetes will get simpler and more flexible, and a lot of the ugliness of early bootstrapping (such as configuring Docker to use our bridge) will go away.

As we proceed down this path, we’ll certainly keep our eyes and ears open for better ways to integrate and simplify. If you have thoughts on how we can do that, we really would like to hear them — find us on slack or on our network SIG mailing-list.

译文

原文

查看更多