Docker 容器 DNAT 异常排查

Jan 15, 2021 01:15 · 3180 words · 7 minute read Network Docker Debug

一句话描述问题现象:二层网络无法访问宿主机上以容器形式运行的服务,TCP 握手失败。

  • curl 超时
  • telnet 尝试 TCP 连接失败

因为已找到原因,所以很轻易就能重现出完全一致的异常现象,在此逐字还原完整的排查过程。

本文以 nginx 为例,所在主机 IP 为 10.211.55.15。

1. 排除服务本身存在异常

  1. 确认 80 端口是否被监听中:

    $ ss -ltn | grep 80
    LISTEN     0      128          *:80                       *:*
    
  2. 确认 nginx 容器正在运行:

    $ docker ps
    CONTAINER ID   IMAGE          COMMAND                  CREATED          STATUS          PORTS                NAMES
    2e8f73e37c27   nginx:latest   "/docker-entrypoint.…"   24 minutes ago   Up 24 minutes   0.0.0.0:80->80/tcp   nginx
    
  3. 确认服务本身并无异常:

    $ curl http://127.0.0.1:80/
    <!DOCTYPE html>
    # ...
    </html>
    
    $ curl http://10.211.55.15:80/ # on the host
    <!DOCTYPE html>
    # ...
    </html>
    

容器应用程序本身没有任何问题,那么网络就有问题。

2. 确认 TCP 数据包是否到达容器网络命名空间中

确认网络问题最直观有力的方法就是抓包,我们可以先选择一种抓包顺序(从里向外或从外向里),我选择从里向外排查,与 Kubernetes 命名空间中抓包的手法相同:

  1. 先找出容器应用程序的 PID:

    $ docker inspect -f {{.State.Pid}} nginx
    25248
    
  2. ps 也可以,简单粗暴,殊途同归:

    $ ps -ef | grep nginx
    root     25248 25225  0 09:19 ?        00:00:00 nginx: master process nginx -g daemon off;
    101      25299 25248  0 09:19 ?        00:00:00 nginx: worker process
    root     27389  9506  0 09:59 pts/0    00:00:00 grep --color=auto nginx
    
  3. nsenter 进入容器命名空间:

    $ nsenter -n -t 25248
    $ ip a
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
    4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
        link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
        inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
        valid_lft forever preferred_lft forever
    
  4. 在 eth0 虚拟网卡也就是为容器配备的 veth 设备抓包:

    同时在宿主机外 curl http://10.211.55.15:80/

    $ tcpdump -i eth0 tcp port 80
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
    

    无任何输出,而在宿主机内 curl http://10.211.55.15:80/

    $ tcpdump -i eth0 tcp port 80
    10:11:56.008758 IP 10.211.55.15.56108 > iperror.http: Flags [S], seq 3567635538, win 43690, options [mss 65495,sackOK,TS val 3304992 ecr 0,nop,wscale 7], length 0
    10:11:56.008803 IP iperror.http > 10.211.55.15.56108: Flags [S.], seq 4217367594, ack 3567635539, win 28960, options [mss 1460,sackOK,TS val 3304992 ecr 3304992,nop,wscale 7], length 0
    10:11:56.008822 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 1, win 342, options [nop,nop,TS val 3304992 ecr 3304992], length 0
    10:11:56.008883 IP 10.211.55.15.56108 > iperror.http: Flags [P.], seq 1:77, ack 1, win 342, options [nop,nop,TS val 3304992 ecr 3304992], length 76: HTTP: GET / HTTP/1.1
    10:11:56.008891 IP iperror.http > 10.211.55.15.56108: Flags [.], ack 77, win 227, options [nop,nop,TS val 3304992 ecr 3304992], length 0
    10:11:56.009032 IP iperror.http > 10.211.55.15.56108: Flags [P.], seq 1:239, ack 77, win 227, options [nop,nop,TS val 3304992 ecr 3304992], length 238: HTTP: HTTP/1.1 200 OK
    10:11:56.009048 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 239, win 350, options [nop,nop,TS val 3304992 ecr 3304992], length 0
    10:11:56.009065 IP iperror.http > 10.211.55.15.56108: Flags [P.], seq 239:851, ack 77, win 227, options [nop,nop,TS val 3304992 ecr 3304992], length 612: HTTP
    10:11:56.009073 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 851, win 360, options [nop,nop,TS val 3304992 ecr 3304992], length 0
    10:11:56.009287 IP 10.211.55.15.56108 > iperror.http: Flags [F.], seq 77, ack 851, win 360, options [nop,nop,TS val 3304992 ecr 3304992], length 0
    10:11:56.009793 IP iperror.http > 10.211.55.15.56108: Flags [F.], seq 851, ack 78, win 227, options [nop,nop,TS val 3304993 ecr 3304992], length 0
    10:11:56.009808 IP 10.211.55.15.56108 > iperror.http: Flags [.], ack 852, win 360, options [nop,nop,TS val 3304993 ecr 3304993], length 0
    

这说明来自宿主机外的 TCP 数据包并未到达容器的 veth 设备。

3. 确认 TCP 数据包是否到达 docker0 网桥

从里向外抓包,下一个点就是 veth 设备的另一头,docker0 网桥:

  1. 从 nginx 容器所在网络命名空间中退出 exit 或 ^ + D

  2. 在 docker0 网桥抓包:

    同时在宿主机外 curl http://10.211.55.15:80/

    $ tcpdump -i docker0 tcp and port 80
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
    

    无任何输出,而在宿主机内 curl http://10.211.55.15:80/

    tcpdump -i docker0 tcp and port 80
    10:19:22.181690 IP iperror.shared.56110 > 172.17.0.2.http: Flags [S], seq 2393925363, win 43690, options [mss 65495,sackOK,TS val 3751165 ecr 0,nop,wscale 7], length 0
    10:19:22.181744 IP 172.17.0.2.http > iperror.shared.56110: Flags [S.], seq 2018401623, ack 2393925364, win 28960, options [mss 1460,sackOK,TS val 3751165 ecr 3751165,nop,wscale 7], length 0
    10:19:22.181764 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 1, win 342, options [nop,nop,TS val 3751165 ecr 3751165], length 0
    10:19:22.181832 IP iperror.shared.56110 > 172.17.0.2.http: Flags [P.], seq 1:77, ack 1, win 342, options [nop,nop,TS val 3751165 ecr 3751165], length 76: HTTP: GET / HTTP/1.1
    10:19:22.181839 IP 172.17.0.2.http > iperror.shared.56110: Flags [.], ack 77, win 227, options [nop,nop,TS val 3751165 ecr 3751165], length 0
    10:19:22.182180 IP 172.17.0.2.http > iperror.shared.56110: Flags [P.], seq 1:239, ack 77, win 227, options [nop,nop,TS val 3751165 ecr 3751165], length 238: HTTP: HTTP/1.1 200 OK
    10:19:22.182192 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 239, win 350, options [nop,nop,TS val 3751165 ecr 3751165], length 0
    10:19:22.182211 IP 172.17.0.2.http > iperror.shared.56110: Flags [P.], seq 239:851, ack 77, win 227, options [nop,nop,TS val 3751165 ecr 3751165], length 612: HTTP
    10:19:22.182219 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 851, win 360, options [nop,nop,TS val 3751165 ecr 3751165], length 0
    10:19:22.182380 IP iperror.shared.56110 > 172.17.0.2.http: Flags [F.], seq 77, ack 851, win 360, options [nop,nop,TS val 3751165 ecr 3751165], length 0
    10:19:22.182942 IP 172.17.0.2.http > iperror.shared.56110: Flags [F.], seq 851, ack 78, win 227, options [nop,nop,TS val 3751166 ecr 3751165], length 0
    10:19:22.182957 IP iperror.shared.56110 > 172.17.0.2.http: Flags [.], ack 852, win 360, options [nop,nop,TS val 3751166 ecr 3751166], length 0
    

与上面结果一致,也就是说来自宿主机外的 TCP 数据包并未到达 docker0 网桥。

4. 确认 TCP 数据包是否到达宿主机

还剩最后一个可以抓包的点就是宿主机的网卡了:

  1. 在 eth0 网卡抓包:

    同时在宿主机外 curl http://10.211.55.15:80/

    $ tcpdump -i eth0 tcp and port 80
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
    tcpdump -i eth0 tcp and port 80
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
    10:24:18.281167 IP 10.211.55.2.61471 > iperror.shared.http: Flags [SEW], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182086346 ecr 0,sackOK,eol], length 0
    10:24:19.305082 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182087346 ecr 0,sackOK,eol], length 0
    10:24:20.339621 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182088346 ecr 0,sackOK,eol], length 0
    10:24:21.363671 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182089346 ecr 0,sackOK,eol], length 0
    10:24:22.396893 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182090346 ecr 0,sackOK,eol], length 0
    10:24:23.417779 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182091346 ecr 0,sackOK,eol], length 0
    10:24:25.471857 IP 10.211.55.2.61471 > iperror.shared.http: Flags [S], seq 4259795053, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 182093346 ecr 0,sackOK,eol], length 0
    

这次不一样,来自宿主机外的 TCP 数据包到达了 eth0 网卡,说明二层网络/三层网络没问题。

如果来自宿主机外的 TCP 数据包无法到达 eth0 网卡,就要留心二层网络/三层网络是否可达和安全组了。

我们可以再抓一份并导出至 pcap 文件来使用 WireShark 更直观地分析 tcpdump -i eth0 -s 65535 tcp and port 80 -w debug.pcap

TCP 握手第一发有去无回,说明数据包在内核中遭遇了不测。

5. Netfilter

在 eth0 网卡与 docker 0 网桥之间是漫长的 Linux 内核栈,而 Netfilter 是最可能发生丢包的地方。

Netfilter 是 Linux 内核的框架,提供了对网络数据包进行修改(比如 NAT)和过滤(比如防火墙)的能力。

根据上图,我们要对 Netfilter 网络层(绿色背景)四种表 raw –> mangle –> nat –> filter(优先级依次降低)的各个 Hook 点逐一排查:

  1. raw 表

    不常用,不太可能出问题,但还是要看一眼。

    $ iptables -t raw -nL
    Chain PREROUTING (policy ACCEPT)
    target     prot opt source               destination
    
    Chain OUTPUT (policy ACCEPT)
    target     prot opt source               destination
    

    从左向右第一个 Hook 点也就是 raw 表的 PREROUTING 链啥也没丢。

  2. mangle 表

    主要用于修改数据包。

    $ iptables -t mangle -nL
    Chain PREROUTING (policy ACCEPT)
    target     prot opt source               destination
    
    Chain INPUT (policy ACCEPT)
    target     prot opt source               destination
    
    Chain FORWARD (policy ACCEPT)
    target     prot opt source               destination
    
    Chain OUTPUT (policy ACCEPT)
    target     prot opt source               destination
    
    Chain POSTROUTING (policy ACCEPT)
    target     prot opt source               destination
    

    第二个 Hook 点也就是 mangle 表的 PREROUTING 链啥同样也没丢。

  3. nat 表

    DNAT 即目标 IP 地址转换由此实现。

    $ iptables -t nat -nL
    Chain PREROUTING (policy ACCEPT)
    target     prot opt source               destination
    DOCKER     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
    
    Chain INPUT (policy ACCEPT)
    target     prot opt source               destination
    
    Chain OUTPUT (policy ACCEPT)
    target     prot opt source               destination
    DOCKER     all  --  0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL
    
    Chain POSTROUTING (policy ACCEPT)
    target     prot opt source               destination
    MASQUERADE  all  --  172.17.0.0/16        0.0.0.0/0
    MASQUERADE  tcp  --  172.17.0.2           172.17.0.2           tcp dpt:80
    
    Chain DOCKER (2 references)
    target     prot opt source               destination
    RETURN     all  --  0.0.0.0/0            0.0.0.0/0
    DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:80 to:172.17.0.2:80
    

    根据 PREROUTING -> DOCKER 链我们可以清楚地看到 Docker 是如何通过 DNAT 实现将目标主机 IP + 端口转换为容器内 IP + 端口的,看着也没啥问题。但是通过上述所有分析我们猜想数据包并未到达 nat 表的 PREROUTING 链,一旦通过此地,目的 IP 就会被转换为 172.17.0.2,数据包将直奔 docker0 网桥。

  4. filter 表

    应用程序前最后一道门是 filter 表,nat 表都没过,数据包不会到这的。

    $ iptables -t filter -nL
    Chain INPUT (policy ACCEPT)
    target     prot opt source               destination
    
    Chain FORWARD (policy DROP)
    target     prot opt source               destination
    DOCKER-USER  all  --  0.0.0.0/0            0.0.0.0/0
    DOCKER-ISOLATION-STAGE-1  all  --  0.0.0.0/0            0.0.0.0/0
    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
    DOCKER     all  --  0.0.0.0/0            0.0.0.0/0
    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
    
    Chain OUTPUT (policy ACCEPT)
    target     prot opt source               destination
    
    Chain DOCKER (1 references)
    target     prot opt source               destination
    ACCEPT     tcp  --  0.0.0.0/0            172.17.0.2           tcp dpt:80
    
    Chain DOCKER-ISOLATION-STAGE-1 (1 references)
    target     prot opt source               destination
    DOCKER-ISOLATION-STAGE-2  all  --  0.0.0.0/0            0.0.0.0/0
    RETURN     all  --  0.0.0.0/0            0.0.0.0/0
    
    Chain DOCKER-ISOLATION-STAGE-2 (1 references)
    target     prot opt source               destination
    DROP       all  --  0.0.0.0/0            0.0.0.0/0
    RETURN     all  --  0.0.0.0/0            0.0.0.0/0
    
    Chain DOCKER-USER (1 references)
    target     prot opt source               destination
    RETURN     all  --  0.0.0.0/0            0.0.0.0/0
    

iptables 带上 -v 选项可以显示每条链的数据包通过数:

$ iptables -t nat -nL -v
Chain PREROUTING (policy ACCEPT 330 packets, 83663 bytes)
 pkts bytes target     prot opt in     out     source               destination
  179 11376 DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

要确定数据包由 mangle 表的 PREROUTING 链丢弃,我们利用压测工具向主机发送大量数据包:

$ siege http://10.211.55.15:80/
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
[alert] socket: select and discovered it's not ready sock.c:384: Operation timed out
[alert] socket: read check timed out(30) sock.c:273: Operation timed out
$ iptables -t nat -nL -v
Chain PREROUTING (policy ACCEPT 334 packets, 84455 bytes)
 pkts bytes target     prot opt in     out     source               destination
  608 38832 DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

基本没啥数据包通过。

6. 丢包分析

dropwatch 是专门用于观察 Linux 内核丢包的工具,我们使用它来找出真正的丢包点(同时利用压测工具向主机发送大量数据包可以使特征数据更明显):

$ dropwatch -lkas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at skb_queue_purge+18 (0xffffffff9bc3c808)
25 drops at ip_error+68 (0xffffffff9bc988f0)
8 drops at netlink_broadcast_filtered+2b9 (0xffffffff9bc8efc9)
1 drops at netlink_unicast+1fa (0xffffffff9bc90d6a)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
1 drops at __udp4_lib_rcv+bb (0xffffffff9bcd137b)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)
25 drops at ip_error+68 (0xffffffff9bc988f0)

定位到 ip_error,至此得到的信息已经足够我们来 Google 了。造成丢包的原因是没有开启 Linux 的 IP 转发功能,只能接收目的主机为其 IP 地址的数据包,其他的将丢弃。

$ sysctl -a | grep "\.forwarding" | grep ipv4
sysctl: reading key "net.ipv6.conf.all.stable_secret"
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.docker0.stable_secret"
sysctl: reading key "net.ipv6.conf.eth0.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"
sysctl: reading key "net.ipv6.conf.vethdb35318.stable_secret"
net.ipv4.conf.all.forwarding = 0
net.ipv4.conf.default.forwarding = 0
net.ipv4.conf.docker0.forwarding = 0
net.ipv4.conf.eth0.forwarding = 0
net.ipv4.conf.lo.forwarding = 0
net.ipv4.conf.vethdb35318.forwarding = 0

根据 Docker 官方文档 http://docs.docker.oeynet.com/engine/userguide/networking/default_network/container-communication/ 这个内核参数是要设置成 1 的:

$ sysctl net.ipv4.conf.all.forwarding=1
net.ipv4.conf.all.forwarding = 1
$ sysctl -a | grep "\.forwarding" | grep ipv4
sysctl: reading key "net.ipv6.conf.all.stable_secret"
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.docker0.stable_secret"
sysctl: reading key "net.ipv6.conf.eth0.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"
sysctl: reading key "net.ipv6.conf.vethdb35318.stable_secret"
net.ipv4.conf.all.forwarding = 1
net.ipv4.conf.default.forwarding = 1
net.ipv4.conf.docker0.forwarding = 1
net.ipv4.conf.eth0.forwarding = 1
net.ipv4.conf.lo.forwarding = 1
net.ipv4.conf.vethdb35318.forwarding = 1

开启后数据包成功通过 Netfilter 到达 docker 0 网桥,一路畅通无阻:

$ curl http://10.211.55.15:80/
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Linux Netfilter 相关技术文章阅读