踩坑基于 HTTPS 的 etcd 集群部署
Mar 31, 2019 12:00 · 2018 words · 5 minute read
简单介绍一下,etcd 是一种分布式键值存储,用于在集群上存储和检索配置。它也是 Kubernetes 中存储和检索对象状态信息的核心组件。
部署基于 HTTP 通讯的 etcd 集群非常简单,照着官方文档来就行了。推荐使用 Ansible 来部署,比一台一台自己手动部署省事得多:https://github.com/crazytaxii/ansible-etcds
如何检查 etcd 集群的状态:
$ etcdctl cluster-health
member 3799f16725672c8e is healthy: got healthy result from http://10.211.55.25:2379
member 666302b2b897a609 is healthy: got healthy result from http://10.211.55.26:2379
member ea6898e41bf66b4c is healthy: got healthy result from http://10.211.55.24:2379
cluster is healthy
$ etcdctl member list
3799f16725672c8e: name=etcd-service2 peerURLs=http://10.211.55.25:2380 clientURLs=http://10.211.55.25:2379 isLeader=true
666302b2b897a609: name=etcd-service3 peerURLs=http://10.211.55.26:2380 clientURLs=http://10.211.55.26:2379 isLeader=false
ea6898e41bf66b4c: name=etcd-service1 peerURLs=http://10.211.55.24:2380 clientURLs=http://10.211.55.24:2379 isLeader=false
但是考虑到安全性,有时候需要走 HTTPS。etcd 支持自动 TLS 以及基于证书的点对点身份验证。因为大部分 Kubernetes 都部署在内网,而内网采用私有 IP,权威 CA 只能签署域名证书,对于签署到 IP 的无法实现,所以我们需要自建 CA 签发证书,也是照着官方文档一步一步来。
当然还是推荐使用 Ansible 来部署:https://github.com/crazytaxii/ansible-etcds/tree/https,一步到位(如果觉得很棒 star 一下哟 (•̀ω•́)✧)。
假设我们已经成功地在三个节点上部署了基于 HTTP 的 etcd 集群,现在要开启 HTTPS,首先停止正在运行的 etcd:
$ systemctl stop etcd.service
$ systemctl disable etcd.service
然后生成证书文件(自己手动来或者直接跑 shell 脚本),再执行 Ansible 剧本。但是最后一步 systemd 启动 etcd 超时了:
__________________________________
< TASK [etcds : Start etcd daemon] >
----------------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
fatal: [10.211.55.25]: FAILED! => {"changed": false, "msg": "Unable to start service etcd.service: Job for etcd.service failed because a timeout was exceeded. See \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}
fatal: [10.211.55.26]: FAILED! => {"changed": false, "msg": "Unable to start service etcd.service: Job for etcd.service failed because a timeout was exceeded. See \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}
fatal: [10.211.55.24]: FAILED! => {"changed": false, "msg": "Unable to start service etcd.service: Job for etcd.service failed because a timeout was exceeded. See \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}
挑其中的一台机器看一下是什么情况:
$ systemctl status etcd
● etcd.service - etcd-service
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (start) since Sun 2019-03-31 11:16:26 CST; 4s ago
Docs: https://github.com/coreos/etcd
Main PID: 16556 (etcd)
Tasks: 8
CGroup: /system.slice/etcd.service
└─16556 /usr/local/bin/etcd --name etcd-service1 --data-dir /var/lib/etcd --listen-client-urls https://10.211.55.24:2379,https://127.0.0.1:2379 --listen-peer-urls https://10.211.55.24:2380 --initial-advertise-peer-urls https://10.211.55.24:2380 --advertise-client-ur...
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58712" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58710" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43042" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43044" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58719" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58718" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58726" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58728" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43046" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.25:43052" (error "tls: first record does not look like a TLS handshake", ServerName "")
Mar 31 11:16:30 centos-node11.shared etcd[16556]: rejected connection from "10.211.55.26:58734" (error "tls: first record does not look like a TLS handshake", ServerName "")
连接被拒绝,直觉告诉我们一定和 TLS 有关系。
$ ps aux | grep "etcd"
root 18173 4.3 0.6 10534444 22680 ? Ssl 11:24 0:02 /usr/local/bin/etcd --name etcd-service1 --data-dir /var/lib/etcd --listen-client-urls https://10.211.55.24:2379,https://127.0.0.1:2379 --listen-peer-urls https://10.211.55.24:2380 --initial-advertise-peer-urls https://10.211.55.24:2380 --advertise-client-urls https://10.211.55.24:2379 --initial-cluster-token etcd-cluster-1 --initial-cluster etcd-service1=https://10.211.55.24:2380,etcd-service2=https://10.211.55.25:2380,etcd-service3=https://10.211.55.26:2380 --initial-cluster-state new --heartbeat-interval 1000 --election-timeout 5000 --trusted-ca-file /etc/ssl/etcd/ca.pem --key-file /etc/ssl/etcd/key.pem --cert-file /etc/ssl/etcd/cert.pem --peer-client-cert-auth=true --peer-trusted-ca-file /etc/ssl/etcd/ca.pem --peer-key-file /etc/ssl/etcd/key.pem --peer-cert-file /etc/ssl/etcd/cert.pem
etcd 其实还是在运行中的,启动 etcd 的命令并没有问题,索性就杀进程手动跑 etcd。
$ /usr/local/bin/etcd \
> --name etcd-service1 \
> --data-dir /var/lib/etcd \
> --listen-client-urls https://10.211.55.24:2379,https://127.0.0.1:2379 \
> --listen-peer-urls https://10.211.55.24:2380 \
> --initial-advertise-peer-urls https://10.211.55.24:2380 \
> --advertise-client-urls https://10.211.55.24:2379 \
> --initial-cluster-token etcd-cluster-1 \
> --initial-cluster etcd-service1=https://10.211.55.24:2380,etcd-service2=https://10.211.55.25:2380,etcd-service3=https://10.211.55.26:2380 \
> --initial-cluster-state new \
> --heartbeat-interval 1000 \
> --election-timeout 5000 \
> --trusted-ca-file /etc/ssl/etcd/ca.pem \
> --key-file /etc/ssl/etcd/key.pem \
> --cert-file /etc/ssl/etcd/cert.pem \
> --peer-client-cert-auth=true \
> --peer-trusted-ca-file /etc/ssl/etcd/ca.pem \
> --peer-key-file /etc/ssl/etcd/key.pem \
> --peer-cert-file /etc/ssl/etcd/cert.pem
2019-03-31 11:30:20.803355 I | etcdmain: etcd Version: 3.3.12
2019-03-31 11:30:20.803784 I | etcdmain: Git SHA: d57e8b8
2019-03-31 11:30:20.803790 I | etcdmain: Go Version: go1.10.8
2019-03-31 11:30:20.803793 I | etcdmain: Go OS/Arch: linux/amd64
2019-03-31 11:30:20.803796 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2019-03-31 11:30:20.803841 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-03-31 11:30:20.803880 I | embed: peerTLS: cert = /etc/ssl/etcd/cert.pem, key = /etc/ssl/etcd/key.pem, ca = , trusted-ca = /etc/ssl/etcd/ca.pem, client-cert-auth = true, crl-file =
2019-03-31 11:30:20.804655 I | embed: listening for peers on https://10.211.55.24:2380
2019-03-31 11:30:20.804704 I | embed: listening for client requests on 10.211.55.24:2379
2019-03-31 11:30:20.804730 I | embed: listening for client requests on 127.0.0.1:2379
2019-03-31 11:30:20.805585 I | etcdserver: name = etcd-service1
2019-03-31 11:30:20.805592 I | etcdserver: data dir = /var/lib/etcd
2019-03-31 11:30:20.805595 I | etcdserver: member dir = /var/lib/etcd/member
2019-03-31 11:30:20.805598 I | etcdserver: heartbeat = 1000ms
2019-03-31 11:30:20.805601 I | etcdserver: election = 5000ms
2019-03-31 11:30:20.805604 I | etcdserver: snapshot count = 100000
2019-03-31 11:30:20.805615 I | etcdserver: advertise client URLs = https://10.211.55.24:2379
2019-03-31 11:30:20.806188 I | etcdserver: restarting member ea6898e41bf66b4c in cluster 66898ae5243d3dda at commit index 11
2019-03-31 11:30:20.806222 I | raft: ea6898e41bf66b4c became follower at term 200
2019-03-31 11:30:20.806233 I | raft: newRaft ea6898e41bf66b4c [peers: [], term: 200, commit: 11, applied: 0, lastindex: 11, lastterm: 2]
2019-03-31 11:30:20.807733 W | auth: simple token is not cryptographically signed
2019-03-31 11:30:20.809291 I | etcdserver: starting server... [version: 3.3.12, cluster version: to_be_decided]
2019-03-31 11:30:20.810370 I | etcdserver/membership: added member 3799f16725672c8e [http://10.211.55.25:2380] to cluster 66898ae5243d3dda
2019-03-31 11:30:20.810390 I | rafthttp: starting peer 3799f16725672c8e...
2019-03-31 11:30:20.810421 I | rafthttp: started HTTP pipelining with peer 3799f16725672c8e
2019-03-31 11:30:20.810956 I | rafthttp: started streaming with peer 3799f16725672c8e (writer)
2019-03-31 11:30:20.813567 I | embed: ClientTLS: cert = /etc/ssl/etcd/cert.pem, key = /etc/ssl/etcd/key.pem, ca = , trusted-ca = /etc/ssl/etcd/ca.pem, client-cert-auth = false, crl-file =
2019-03-31 11:30:20.813981 I | rafthttp: started peer 3799f16725672c8e
2019-03-31 11:30:20.814013 I | rafthttp: added peer 3799f16725672c8e
2019-03-31 11:30:20.814368 I | etcdserver/membership: added member 666302b2b897a609 [http://10.211.55.26:2380] to cluster 66898ae5243d3dda
2019-03-31 11:30:20.814403 I | rafthttp: starting peer 666302b2b897a609...
2019-03-31 11:30:20.814426 I | rafthttp: started HTTP pipelining with peer 666302b2b897a609
2019-03-31 11:30:20.814601 I | rafthttp: started streaming with peer 3799f16725672c8e (stream Message reader)
2019-03-31 11:30:20.815089 I | rafthttp: started streaming with peer 666302b2b897a609 (writer)
2019-03-31 11:30:20.815777 I | rafthttp: started streaming with peer 3799f16725672c8e (writer)
2019-03-31 11:30:20.815936 I | rafthttp: started streaming with peer 3799f16725672c8e (stream MsgApp v2 reader)
2019-03-31 11:30:20.816499 I | rafthttp: started peer 666302b2b897a609
2019-03-31 11:30:20.816534 I | rafthttp: started streaming with peer 666302b2b897a609 (writer)
2019-03-31 11:30:20.816552 I | rafthttp: added peer 666302b2b897a609
2019-03-31 11:30:20.817037 I | etcdserver/membership: added member ea6898e41bf66b4c [http://10.211.55.24:2380] to cluster 66898ae5243d3dda
2019-03-31 11:30:20.817083 I | rafthttp: started streaming with peer 666302b2b897a609 (stream MsgApp v2 reader)
2019-03-31 11:30:20.817269 N | etcdserver/membership: set the initial cluster version to 3.3
2019-03-31 11:30:20.817321 I | rafthttp: started streaming with peer 666302b2b897a609 (stream Message reader)
2019-03-31 11:30:20.817340 I | etcdserver/api: enabled capabilities for version 3.3
2019-03-31 11:30:20.828475 I | embed: rejected connection from "10.211.55.26:39091" (error "tls: first record does not look like a TLS handshake", ServerName "")
2019-03-31 11:30:20.828550 I | embed: rejected connection from "10.211.55.26:39090" (error "tls: first record does not look like a TLS handshake", ServerName "")
2019-03-31 11:30:20.848318 I | embed: rejected connection from "10.211.55.25:51382" (error "tls: first record does not look like a TLS handshake", ServerName "")
2019-03-31 11:30:20.848383 I | embed: rejected connection from "10.211.55.25:51384" (error "tls: first record does not look like a TLS handshake", ServerName "")
看到日志的后半部分趋近于刚才 systemctl status etcd.service
的输出。挖掘一下前几行:
rafthttp: started HTTP pipelining with peer 3799f16725672c8e
etcdserver/membership: added member 666302b2b897a609 [http://10.211.55.26:2380] to cluster 66898ae5243d3dda
明明开启了 HTTPS,但是通讯却回落到了 HTTP。。。当时遇到这个问题我 Google 了很久,但是类似的情况非常少,没有什么有价值的信息。真正帮助我解决问题的是 https://github.com/etcd-io/etcd/issues/10128 这个帖子,尝试着删除了 etcd 的数据库后重新启动 etcd,开启成功。
$ export ETCDCTL_API=3
$ etcdctl --cacert=/etc/ssl/etcd/ca.pem --cert=/etc/ssl/etcd/cert.pem --key=/etc/ssl/etcd/key.pem --endpoints=https://10.211.55.24:2379,https://10.211.55.25:2379,https://10.211.55.26:2379 endpoint health
https://10.211.55.24:2379 is healthy: successfully committed proposal: took = 1.958906ms
https://10.211.55.26:2379 is healthy: successfully committed proposal: took = 2.75776ms
https://10.211.55.25:2379 is healthy: successfully committed proposal: took = 1.809552ms
之前走 HTTP 通讯时在数据库中留了一些“脏数据”,etcd 在启动时也一定读取了这些数据。正常情况下直接就在全新的机器上部署 etcd 集群了,并不会存在历史遗留数据,所以这个问题很少见,一旦碰上了也够呛,查起来很浪费时间。