Docker 容器 DNAT 异常排查
Jan 15, 2021 01:15 · 3180 words · 7 minute read
一句话描述问题现象:二层网络无法访问宿主机上以容器形式运行的服务,TCP 握手失败。
curl
超时telnet
尝试 TCP 连接失败
因为已找到原因,所以很轻易就能重现出完全一致的异常现象,在此逐字还原完整的排查过程。
本文以 nginx 为例,所在主机 IP 为 10.211.55.15。
1. 排除服务本身存在异常
-
确认 80 端口是否被监听中:
$ ss -ltn | grep 80 LISTEN 0 128 *:80 *:*
-
确认 nginx 容器正在运行:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 2e8f73e37c27 nginx:latest "/docker-entrypoint.…" 24 minutes ago Up 24 minutes 0.0.0.0:80->80/tcp nginx
-
确认服务本身并无异常:
$ curl http://127.0.0.1:80/ <!DOCTYPE html> # ... </html> $ curl http://10.211.55.15:80/ # on the host <!DOCTYPE html> # ... </html>
容器应用程序本身没有任何问题,那么网络就有问题。
2. 确认 TCP 数据包是否到达容器网络命名空间中
确认网络问题最直观有力的方法就是抓包,我们可以先选择一种抓包顺序(从里向外或从外向里),我选择从里向外排查,与 Kubernetes 命名空间中抓包的手法相同:
-
先找出容器应用程序的 PID:
$ docker inspect -f {{.State.Pid}} nginx 25248
-
ps
也可以,简单粗暴,殊途同归:$ ps -ef | grep nginx root 25248 25225 0 09:19 ? 00:00:00 nginx: master process nginx -g daemon off; 101 25299 25248 0 09:19 ? 00:00:00 nginx: worker process root 27389 9506 0 09:59 pts/0 00:00:00 grep --color=auto nginx
-
nsenter
进入容器命名空间:$ nsenter -n -t 25248 $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever
-
在 eth0 虚拟网卡也就是为容器配备的 veth 设备抓包:
同时在宿主机外
curl http://10.211.55.15:80/
$ tcpdump -i eth0 tcp port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
无任何输出,而在宿主机内
curl http://10.211.55.15:80/
$ tcpdump -i eth0 tcp port 80 10:11:56.008758 IP 10.211.55.15.56108 > iperror.http: Flags [S], seq 3567635538, win 43690, options [mss 65495,sackOK,TS val 3304992 ecr 0,nop,wscale 7], length 0 10:11:56.008803 IP iperror.http > 10.211.55.15.56108: Flags [S.], seq 4217367594, ack 3567635539, win 28960, options [mss 1460,sackOK,TS val 3304992 ecr 3304992,nop,wscale 7], length 0 10:11:56.008822 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 1, win 342, options [nop,nop,TS val 3304992 ecr 3304992], length 0 10:11:56.008883 IP 10.211.55.15.56108 > iperror.http: Flags [P.], seq 1:77, ack 1, win 342, options [nop,nop,TS val 3304992 ecr 3304992], length 76: HTTP: GET / HTTP/1.1 10:11:56.008891 IP iperror.http > 10.211.55.15.56108: Flags [.], ack 77, win 227, options [nop,nop,TS val 3304992 ecr 3304992], length 0 10:11:56.009032 IP iperror.http > 10.211.55.15.56108: Flags [P.], seq 1:239, ack 77, win 227, options [nop,nop,TS val 3304992 ecr 3304992], length 238: HTTP: HTTP/1.1 200 OK 10:11:56.009048 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 239, win 350, options [nop,nop,TS val 3304992 ecr 3304992], length 0 10:11:56.009065 IP iperror.http > 10.211.55.15.56108: Flags [P.], seq 239:851, ack 77, win 227, options [nop,nop,TS val 3304992 ecr 3304992], length 612: HTTP 10:11:56.009073 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 851, win 360, options [nop,nop,TS val 3304992 ecr 3304992], length 0 10:11:56.009287 IP 10.211.55.15.56108 > iperror.http: Flags [F.], seq 77, ack 851, win 360, options [nop,nop,TS val 3304992 ecr 3304992], length 0 10:11:56.009793 IP iperror.http > 10.211.55.15.56108: Flags [F.], seq 851, ack 78, win 227, options [nop,nop,TS val 3304993 ecr 3304992], length 0 10:11:56.009808 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 852, win 360, options [nop,nop,TS val 3304993 ecr 3304993], length 0
这说明来自宿主机外的 TCP 数据包并未到达容器的 veth 设备。
3. 确认 TCP 数据包是否到达 docker0 网桥
从里向外抓包,下一个点就是 veth 设备的另一头,docker0 网桥:
-
从 nginx 容器所在网络命名空间中退出
exit
或 ^ + D -
在 docker0 网桥抓包:
同时在宿主机外
curl http://10.211.55.15:80/
$ tcpdump -i docker0 tcp and port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
无任何输出,而在宿主机内
curl http://10.211.55.15:80/
tcpdump -i docker0 tcp and port 80 10:19:22.181690 IP iperror.shared.56110 > 172.17.0.2.http: Flags [S], seq 2393925363, win 43690, options [mss 65495,sackOK,TS val 3751165 ecr 0,nop,wscale 7], length 0 10:19:22.181744 IP 172.17.0.2.http > iperror.shared.56110: Flags [S.], seq 2018401623, ack 2393925364, win 28960, options [mss 1460,sackOK,TS val 3751165 ecr 3751165,nop,wscale 7], length 0 10:19:22.181764 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 1, win 342, options [nop,nop,TS val 3751165 ecr 3751165], length 0 10:19:22.181832 IP iperror.shared.56110 > 172.17.0.2.http: Flags [P.], seq 1:77, ack 1, win 342, options [nop,nop,TS val 3751165 ecr 3751165], length 76: HTTP: GET / HTTP/1.1 10:19:22.181839 IP 172.17.0.2.http > iperror.shared.56110: Flags [.], ack 77, win 227, options [nop,nop,TS val 3751165 ecr 3751165], length 0 10:19:22.182180 IP 172.17.0.2.http > iperror.shared.56110: Flags [P.], seq 1:239, ack 77, win 227, options [nop,nop,TS val 3751165 ecr 3751165], length 238: HTTP: HTTP/1.1 200 OK 10:19:22.182192 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 239, win 350, options [nop,nop,TS val 3751165 ecr 3751165], length 0 10:19:22.182211 IP 172.17.0.2.http > iperror.shared.56110: Flags [P.], seq 239:851, ack 77, win 227, options [nop,nop,TS val 3751165 ecr 3751165], length 612: HTTP 10:19:22.182219 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 851, win 360, options [nop,nop,TS val 3751165 ecr 3751165], length 0 10:19:22.182380 IP iperror.shared.56110 > 172.17.0.2.http: Flags [F.], seq 77, ack 851, win 360, options [nop,nop,TS val 3751165 ecr 3751165], length 0 10:19:22.182942 IP 172.17.0.2.http > iperror.shared.56110: Flags [F.], seq 851, ack 78, win 227, options [nop,nop,TS val 3751166 ecr 3751165], length 0 10:19:22.182957 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 852, win 360, options [nop,nop,TS val 3751166 ecr 3751166], length 0
与上面结果一致,也就是说来自宿主机外的 TCP 数据包并未到达 docker0 网桥。
4. 确认 TCP 数据包是否到达宿主机
还剩最后一个可以抓包的点就是宿主机的网卡了:
-
在 eth0 网卡抓包:
同时在宿主机外
curl http://10.211.55.15:80/
$ tcpdump -i eth0 tcp and port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes tcpdump -i eth0 tcp and port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 10:24:18.281167 IP 10.211.55.2.61471 > iperror.shared.http: Flags [SEW], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182086346 ecr 0,sackOK,eol], length 0 10:24:19.305082 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182087346 ecr 0,sackOK,eol], length 0 10:24:20.339621 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182088346 ecr 0,sackOK,eol], length 0 10:24:21.363671 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182089346 ecr 0,sackOK,eol], length 0 10:24:22.396893 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182090346 ecr 0,sackOK,eol], length 0 10:24:23.417779 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182091346 ecr 0,sackOK,eol], length 0 10:24:25.471857 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182093346 ecr 0,sackOK,eol], length 0
这次不一样,来自宿主机外的 TCP 数据包到达了 eth0 网卡,说明二层网络/三层网络没问题。
如果来自宿主机外的 TCP 数据包无法到达 eth0 网卡,就要留心二层网络/三层网络是否可达和安全组了。
我们可以再抓一份并导出至 pcap 文件来使用 WireShark 更直观地分析 tcpdump -i eth0 -s 65535 tcp and port 80 -w debug.pcap
TCP 握手第一发有去无回,说明数据包在内核中遭遇了不测。
5. Netfilter
在 eth0 网卡与 docker 0 网桥之间是漫长的 Linux 内核栈,而 Netfilter 是最可能发生丢包的地方。
Netfilter 是 Linux 内核的框架,提供了对网络数据包进行修改(比如 NAT)和过滤(比如防火墙)的能力。
根据上图,我们要对 Netfilter 网络层(绿色背景)四种表 raw –> mangle –> nat –> filter(优先级依次降低)的各个 Hook 点逐一排查:
-
raw 表
不常用,不太可能出问题,但还是要看一眼。
$ iptables -t raw -nL Chain PREROUTING (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination
从左向右第一个 Hook 点也就是 raw 表的 PREROUTING 链啥也没丢。
-
mangle 表
主要用于修改数据包。
$ iptables -t mangle -nL Chain PREROUTING (policy ACCEPT) target prot opt source destination Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination Chain POSTROUTING (policy ACCEPT) target prot opt source destination
第二个 Hook 点也就是 mangle 表的 PREROUTING 链啥同样也没丢。
-
nat 表
DNAT 即目标 IP 地址转换由此实现。
$ iptables -t nat -nL Chain PREROUTING (policy ACCEPT) target prot opt source destination DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL Chain INPUT (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL Chain POSTROUTING (policy ACCEPT) target prot opt source destination MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0 MASQUERADE tcp -- 172.17.0.2 172.17.0.2 tcp dpt:80 Chain DOCKER (2 references) target prot opt source destination RETURN all -- 0.0.0.0/0 0.0.0.0/0 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 to:172.17.0.2:80
根据 PREROUTING -> DOCKER 链我们可以清楚地看到 Docker 是如何通过 DNAT 实现将目标主机 IP + 端口转换为容器内 IP + 端口的,看着也没啥问题。但是通过上述所有分析我们猜想数据包并未到达 nat 表的 PREROUTING 链,一旦通过此地,目的 IP 就会被转换为 172.17.0.2,数据包将直奔 docker0 网桥。
-
filter 表
应用程序前最后一道门是 filter 表,nat 表都没过,数据包不会到这的。
$ iptables -t filter -nL Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy DROP) target prot opt source destination DOCKER-USER all -- 0.0.0.0/0 0.0.0.0/0 DOCKER-ISOLATION-STAGE-1 all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 Chain OUTPUT (policy ACCEPT) target prot opt source destination Chain DOCKER (1 references) target prot opt source destination ACCEPT tcp -- 0.0.0.0/0 172.17.0.2 tcp dpt:80 Chain DOCKER-ISOLATION-STAGE-1 (1 references) target prot opt source destination DOCKER-ISOLATION-STAGE-2 all -- 0.0.0.0/0 0.0.0.0/0 RETURN all -- 0.0.0.0/0 0.0.0.0/0 Chain DOCKER-ISOLATION-STAGE-2 (1 references) target prot opt source destination DROP all -- 0.0.0.0/0 0.0.0.0/0 RETURN all -- 0.0.0.0/0 0.0.0.0/0 Chain DOCKER-USER (1 references) target prot opt source destination RETURN all -- 0.0.0.0/0 0.0.0.0/0
iptables
带上 -v
选项可以显示每条链的数据包通过数:
$ iptables -t nat -nL -v
Chain PREROUTING (policy ACCEPT 330 packets, 83663 bytes)
pkts bytes target prot opt in out source destination
179 11376 DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
要确定数据包由 mangle 表的 PREROUTING 链丢弃,我们利用压测工具向主机发送大量数据包:
$ siege http://10.211.55.15:80/
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
$ iptables -t nat -nL -v
Chain PREROUTING (policy ACCEPT 334 packets, 84455 bytes)
pkts bytes target prot opt in out source destination
608 38832 DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
基本没啥数据包通过。
6. 丢包分析
dropwatch 是专门用于观察 Linux 内核丢包的工具,我们使用它来找出真正的丢包点(同时利用压测工具向主机发送大量数据包可以使特征数据更明显):
$ dropwatch -lkas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at skb_queue_purge+18 (0xffffffff9bc3c808)
25 drops at ip_error+68 (0xffffffff9bc988f0)
8 drops at netlink_broadcast_filtered+2b9 (0xffffffff9bc8efc9)
1 drops at netlink_unicast+1fa (0xffffffff9bc90d6a)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
1 drops at __udp4_lib_rcv+bb (0xffffffff9bcd137b)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
定位到 ip_error,至此得到的信息已经足够我们来 Google 了。造成丢包的原因是没有开启 Linux 的 IP 转发功能,只能接收目的主机为其 IP 地址的数据包,其他的将丢弃。
$ sysctl -a | grep "\.forwarding" | grep ipv4
sysctl: reading key "net.ipv6.conf.all.stable_secret"
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.docker0.stable_secret"
sysctl: reading key "net.ipv6.conf.eth0.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"
sysctl: reading key "net.ipv6.conf.vethdb35318.stable_secret"
net.ipv4.conf.all.forwarding = 0
net.ipv4.conf.default.forwarding = 0
net.ipv4.conf.docker0.forwarding = 0
net.ipv4.conf.eth0.forwarding = 0
net.ipv4.conf.lo.forwarding = 0
net.ipv4.conf.vethdb35318.forwarding = 0
根据 Docker 官方文档 http://docs.docker.oeynet.com/engine/userguide/networking/default_network/container-communication/ 这个内核参数是要设置成 1 的:
$ sysctl net.ipv4.conf.all.forwarding=1
net.ipv4.conf.all.forwarding = 1
$ sysctl -a | grep "\.forwarding" | grep ipv4
sysctl: reading key "net.ipv6.conf.all.stable_secret"
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.docker0.stable_secret"
sysctl: reading key "net.ipv6.conf.eth0.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"
sysctl: reading key "net.ipv6.conf.vethdb35318.stable_secret"
net.ipv4.conf.all.forwarding = 1
net.ipv4.conf.default.forwarding = 1
net.ipv4.conf.docker0.forwarding = 1
net.ipv4.conf.eth0.forwarding = 1
net.ipv4.conf.lo.forwarding = 1
net.ipv4.conf.vethdb35318.forwarding = 1
开启后数据包成功通过 Netfilter 到达 docker 0 网桥,一路畅通无阻:
$ curl http://10.211.55.15:80/
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>