踩坑基于 HTTPS 的 etcd 集群部署

Mar 31, 2019 12:00 · 2018 words · 5 minute read HTTPS etcd DevOps Debug Kubernetes

简单介绍一下,etcd 是一种分布式键值存储,用于在集群上存储和检索配置。它也是 Kubernetes 中存储和检索对象状态信息的核心组件。

部署基于 HTTP 通讯的 etcd 集群非常简单,照着官方文档来就行了。推荐使用 Ansible 来部署,比一台一台自己手动部署省事得多:https://github.com/crazytaxii/ansible-etcds

如何检查 etcd 集群的状态:

$ etcdctl cluster-health
member 3799f16725672c8e is healthy: got healthy result from http://10.211.55.25:2379
member 666302b2b897a609 is healthy: got healthy result from http://10.211.55.26:2379
member ea6898e41bf66b4c is healthy: got healthy result from http://10.211.55.24:2379
cluster is healthy
$ etcdctl  member list
3799f16725672c8e: name=etcd-service2 peerURLs=http://10.211.55.25:2380 clientURLs=http://10.211.55.25:2379 isLeader=true
666302b2b897a609: name=etcd-service3 peerURLs=http://10.211.55.26:2380 clientURLs=http://10.211.55.26:2379 isLeader=false
ea6898e41bf66b4c: name=etcd-service1 peerURLs=http://10.211.55.24:2380 clientURLs=http://10.211.55.24:2379 isLeader=false

但是考虑到安全性,有时候需要走 HTTPS。etcd 支持自动 TLS 以及基于证书的点对点身份验证。因为大部分 Kubernetes 都部署在内网,而内网采用私有 IP,权威 CA 只能签署域名证书,对于签署到 IP 的无法实现,所以我们需要自建 CA 签发证书,也是照着官方文档一步一步来。

当然还是推荐使用 Ansible 来部署:https://github.com/crazytaxii/ansible-etcds/tree/https,一步到位(如果觉得很棒 star 一下哟 (•̀ω•́)✧)。

假设我们已经成功地在三个节点上部署了基于 HTTP 的 etcd 集群,现在要开启 HTTPS,首先停止正在运行的 etcd:

$ systemctl stop etcd.service
$ systemctl disable etcd.service

然后生成证书文件(自己手动来或者直接跑 shell 脚本),再执行 Ansible 剧本。但是最后一步 systemd 启动 etcd 超时了:

 __________________________________
< TASK [etcds : Start etcd daemon] >
 ----------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

fatal: [10.211.55.25]: FAILED! => {"changed": false, "msg": "Unable to start service etcd.service: Job for etcd.service failed because a timeout was exceeded. See \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}
fatal: [10.211.55.26]: FAILED! => {"changed": false, "msg": "Unable to start service etcd.service: Job for etcd.service failed because a timeout was exceeded. See \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}
fatal: [10.211.55.24]: FAILED! => {"changed": false, "msg": "Unable to start service etcd.service: Job for etcd.service failed because a timeout was exceeded. See \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}

挑其中的一台机器看一下是什么情况:

$ systemctl status etcd
● etcd.service - etcd-service
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Sun 2019-03-31 11:16:26 CST; 4s ago
     Docs: https://github.com/coreos/etcd
 Main PID: 16556 (etcd)
    Tasks: 8
   CGroup: /system.slice/etcd.service
           └─16556 /usr/local/bin/etcd --name etcd-service1 --data-dir /var/lib/etcd --listen-client-urls https://10.211.55.24:2379,https://127.0.0.1:2379 --listen-peer-urls https://10.211.55.24:2380 --initial-advertise-peer-urls https://10.211.55.24:2380 --advertise-client-ur...

Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58712" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58710" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43042" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43044" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58719" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58718" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58726" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58728" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43046" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43052" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58734" (error "tls: first record does not look like a TLS handshake", ServerName "")

连接被拒绝,直觉告诉我们一定和 TLS 有关系。

$ ps aux | grep "etcd"
root     18173  4.3  0.6 10534444 22680 ?      Ssl  11:24   0:02 /usr/local/bin/etcd --name etcd-service1 --data-dir /var/lib/etcd --listen-client-urls https://10.211.55.24:2379,https://127.0.0.1:2379 --listen-peer-urls https://10.211.55.24:2380 --initial-advertise-peer-urls https://10.211.55.24:2380 --advertise-client-urls https://10.211.55.24:2379 --initial-cluster-token etcd-cluster-1 --initial-cluster etcd-service1=https://10.211.55.24:2380,etcd-service2=https://10.211.55.25:2380,etcd-service3=https://10.211.55.26:2380 --initial-cluster-state new --heartbeat-interval 1000 --election-timeout 5000 --trusted-ca-file /etc/ssl/etcd/ca.pem --key-file /etc/ssl/etcd/key.pem --cert-file /etc/ssl/etcd/cert.pem --peer-client-cert-auth=true --peer-trusted-ca-file /etc/ssl/etcd/ca.pem --peer-key-file /etc/ssl/etcd/key.pem --peer-cert-file /etc/ssl/etcd/cert.pem

etcd 其实还是在运行中的,启动 etcd 的命令并没有问题,索性就杀进程手动跑 etcd。

$ /usr/local/bin/etcd \
>   --name etcd-service1 \
>   --data-dir /var/lib/etcd \
>   --listen-client-urls https://10.211.55.24:2379,https://127.0.0.1:2379 \
>   --listen-peer-urls https://10.211.55.24:2380 \
>   --initial-advertise-peer-urls https://10.211.55.24:2380 \
>   --advertise-client-urls https://10.211.55.24:2379 \
>   --initial-cluster-token etcd-cluster-1 \
>   --initial-cluster etcd-service1=https://10.211.55.24:2380,etcd-service2=https://10.211.55.25:2380,etcd-service3=https://10.211.55.26:2380 \
>   --initial-cluster-state new \
>   --heartbeat-interval 1000 \
>   --election-timeout 5000 \
>   --trusted-ca-file /etc/ssl/etcd/ca.pem \
>   --key-file /etc/ssl/etcd/key.pem \
>   --cert-file /etc/ssl/etcd/cert.pem \
>   --peer-client-cert-auth=true \
>   --peer-trusted-ca-file /etc/ssl/etcd/ca.pem \
>   --peer-key-file /etc/ssl/etcd/key.pem \
>   --peer-cert-file /etc/ssl/etcd/cert.pem
2019-03-31 11:30:20.803355 I | etcdmain: etcd Version: 3.3.12
2019-03-31 11:30:20.803784 I | etcdmain: Git SHA: d57e8b8
2019-03-31 11:30:20.803790 I | etcdmain: Go Version: go1.10.8
2019-03-31 11:30:20.803793 I | etcdmain: Go OS/Arch: linux/amd64
2019-03-31 11:30:20.803796 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2019-03-31 11:30:20.803841 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-03-31 11:30:20.803880 I | embed: peerTLS: cert = /etc/ssl/etcd/cert.pem, key = /etc/ssl/etcd/key.pem, ca = , trusted-ca = /etc/ssl/etcd/ca.pem, client-cert-auth = true, crl-file =
2019-03-31 11:30:20.804655 I | embed: listening for peers on https://10.211.55.24:2380
2019-03-31 11:30:20.804704 I | embed: listening for client requests on 10.211.55.24:2379
2019-03-31 11:30:20.804730 I | embed: listening for client requests on 127.0.0.1:2379
2019-03-31 11:30:20.805585 I | etcdserver: name = etcd-service1
2019-03-31 11:30:20.805592 I | etcdserver: data dir = /var/lib/etcd
2019-03-31 11:30:20.805595 I | etcdserver: member dir = /var/lib/etcd/member
2019-03-31 11:30:20.805598 I | etcdserver: heartbeat = 1000ms
2019-03-31 11:30:20.805601 I | etcdserver: election = 5000ms
2019-03-31 11:30:20.805604 I | etcdserver: snapshot count = 100000
2019-03-31 11:30:20.805615 I | etcdserver: advertise client URLs = https://10.211.55.24:2379
2019-03-31 11:30:20.806188 I | etcdserver: restarting member ea6898e41bf66b4c in cluster 66898ae5243d3dda at commit index 11
2019-03-31 11:30:20.806222 I | raft: ea6898e41bf66b4c became follower at term 200
2019-03-31 11:30:20.806233 I | raft: newRaft ea6898e41bf66b4c [peers: [], term: 200, commit: 11, applied: 0, lastindex: 11, lastterm: 2]
2019-03-31 11:30:20.807733 W | auth: simple token is not cryptographically signed
2019-03-31 11:30:20.809291 I | etcdserver: starting server... [version: 3.3.12, cluster version: to_be_decided]
2019-03-31 11:30:20.810370 I | etcdserver/membership: added member 3799f16725672c8e [http://10.211.55.25:2380] to cluster 66898ae5243d3dda
2019-03-31 11:30:20.810390 I | rafthttp: starting peer 3799f16725672c8e...
2019-03-31 11:30:20.810421 I | rafthttp: started HTTP pipelining with peer 3799f16725672c8e
2019-03-31 11:30:20.810956 I | rafthttp: started streaming with peer 3799f16725672c8e (writer)
2019-03-31 11:30:20.813567 I | embed: ClientTLS: cert = /etc/ssl/etcd/cert.pem, key = /etc/ssl/etcd/key.pem, ca = , trusted-ca = /etc/ssl/etcd/ca.pem, client-cert-auth = false, crl-file =
2019-03-31 11:30:20.813981 I | rafthttp: started peer 3799f16725672c8e
2019-03-31 11:30:20.814013 I | rafthttp: added peer 3799f16725672c8e
2019-03-31 11:30:20.814368 I | etcdserver/membership: added member 666302b2b897a609 [http://10.211.55.26:2380] to cluster 66898ae5243d3dda
2019-03-31 11:30:20.814403 I | rafthttp: starting peer 666302b2b897a609...
2019-03-31 11:30:20.814426 I | rafthttp: started HTTP pipelining with peer 666302b2b897a609
2019-03-31 11:30:20.814601 I | rafthttp: started streaming with peer 3799f16725672c8e (stream Message reader)
2019-03-31 11:30:20.815089 I | rafthttp: started streaming with peer 666302b2b897a609 (writer)
2019-03-31 11:30:20.815777 I | rafthttp: started streaming with peer 3799f16725672c8e (writer)
2019-03-31 11:30:20.815936 I | rafthttp: started streaming with peer 3799f16725672c8e (stream MsgApp v2 reader)
2019-03-31 11:30:20.816499 I | rafthttp: started peer 666302b2b897a609
2019-03-31 11:30:20.816534 I | rafthttp: started streaming with peer 666302b2b897a609 (writer)
2019-03-31 11:30:20.816552 I | rafthttp: added peer 666302b2b897a609
2019-03-31 11:30:20.817037 I | etcdserver/membership: added member ea6898e41bf66b4c [http://10.211.55.24:2380] to cluster 66898ae5243d3dda
2019-03-31 11:30:20.817083 I | rafthttp: started streaming with peer 666302b2b897a609 (stream MsgApp v2 reader)
2019-03-31 11:30:20.817269 N | etcdserver/membership: set the initial cluster version to 3.3
2019-03-31 11:30:20.817321 I | rafthttp: started streaming with peer 666302b2b897a609 (stream Message reader)
2019-03-31 11:30:20.817340 I | etcdserver/api: enabled capabilities for version 3.3
2019-03-31 11:30:20.828475 I | embed: rejected connection from "10.211.55.26:39091" (error "tls: first record does not look like a TLS handshake", ServerName "")
2019-03-31 11:30:20.828550 I | embed: rejected connection from "10.211.55.26:39090" (error "tls: first record does not look like a TLS handshake", ServerName "")
2019-03-31 11:30:20.848318 I | embed: rejected connection from "10.211.55.25:51382" (error "tls: first record does not look like a TLS handshake", ServerName "")
2019-03-31 11:30:20.848383 I | embed: rejected connection from "10.211.55.25:51384" (error "tls: first record does not look like a TLS handshake", ServerName "")

看到日志的后半部分趋近于刚才 systemctl status etcd.service 的输出。挖掘一下前几行

rafthttp: started HTTP pipelining with peer 3799f16725672c8e
etcdserver/membership: added member 666302b2b897a609 [http://10.211.55.26:2380] to cluster 66898ae5243d3dda

明明开启了 HTTPS,但是通讯却回落到了 HTTP。。。当时遇到这个问题我 Google 了很久,但是类似的情况非常少,没有什么有价值的信息。真正帮助我解决问题的是 https://github.com/etcd-io/etcd/issues/10128 这个帖子,尝试着删除了 etcd 的数据库后重新启动 etcd,开启成功。

$ export ETCDCTL_API=3
$ etcdctl --cacert=/etc/ssl/etcd/ca.pem --cert=/etc/ssl/etcd/cert.pem --key=/etc/ssl/etcd/key.pem --endpoints=https://10.211.55.24:2379,https://10.211.55.25:2379,https://10.211.55.26:2379 endpoint health
https://10.211.55.24:2379 is healthy: successfully committed proposal: took = 1.958906ms
https://10.211.55.26:2379 is healthy: successfully committed proposal: took = 2.75776ms
https://10.211.55.25:2379 is healthy: successfully committed proposal: took = 1.809552ms

之前走 HTTP 通讯时在数据库中留了一些“脏数据”,etcd 在启动时也一定读取了这些数据。正常情况下直接就在全新的机器上部署 etcd 集群了,并不会存在历史遗留数据,所以这个问题很少见,一旦碰上了也够呛,查起来很浪费时间。