最近公司的 k8s 集群出現了一個問題:在執行任何 kubectl 命令時都會出現以下錯誤,本文就記錄一下該問題的溯源過程以及解決方式,希望對大家有幫助:
The connection to the server 192.168.100.170:6443 was refused - did you specify the right host or port?問題溯源相信很多朋友都遇到過這個問題,6443 是 k8s APIServer 的默認埠,出現訪問被拒絕肯定是 kubelet 有問題或者被防火牆攔截了,這裡先看一下這個埠上的 kubelet 是不是還或者:
$ netstat -pnlt | grep 6443運行之後什麼都沒有返回,也就是說 APIServer 完全沒有提供服務,那我們就去查看一下 kubelet 的日誌,大家都知道使用 kubeadm 搭建的 k8s集群裡,APIServer 都是在 docker 裡運行的,這裡我們先找到對應的容器,記得加 -a,因為該容器可能已經處於非正常狀態了:
$ docker ps -a | grep apiserver
# 輸出f40d97ee2be6 40a63db91ef8 "kube-apiserver --au…" 2 minutes ago Exited (255) 2 minutes ago k8s_kube-apiserver_kube-apiserver-master1_kube-system_7beef975d93d634ecee05282d3d3a9ac_7184b866fe71e33 registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_kube-apiserver-master1_kube-system_7beef975d93d634ecee05282d3d3a9ac_0這裡能看到兩個容器,可以看到 容器的狀態已經是 Exited 了,注意下面的pause容器,這個只是用來引導 APIServer 的,並不是服務的實際運行容器,所以看不到日誌,所以查看日誌時不要輸錯容器 id 了。接下來查看 APIServer 的日誌:
$ docker logs -f f40d97ee2be6
# 輸出I1230 01:39:42.942786 1 server.go:557] external host was not specified, using 192.168.100.171I1230 01:39:42.942924 1 server.go:146] Version: v1.13.1I1230 01:39:43.325424 1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.I1230 01:39:43.325451 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.I1230 01:39:43.326327 1 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.I1230 01:39:43.326340 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.F1230 01:40:03.328865 1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry [https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true 0xc0004bd440 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:2379: connect: connection refused)從最後一行可以看到,是 APIServer 在嘗試創建存儲時出現了問題,導致無法正確啟動服務,由於 k8s 是使用 etcd 作為存儲的,所以我們再來查看 etcd 的日誌。
注意,我這裡 etcd 也是運行在 docker 裡的,如果你是直接以 service 的形式運行的話需要使用 systemctl status etcd 來查看日誌,下面是 docker 的 etcd 日誌查看:
# 查看 etcd 容器,注意 etcd 也有對應的 pause 容器$ docker ps -a | grep etcd
# 輸出1b8b522ee4e8 3cab8e1b9802 "etcd --advertise-cl…" 7 minutes ago Exited (2) 6 minutes ago k8s_etcd_etcd-master1_kube-system_1051dec0649f2b816946cb1fea184325_942c9440543462e registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_etcd-master1_kube-system_1051dec0649f2b816946cb1fea184325_0# 查看 etcd 日誌$ docker logs -f 1b8b522ee4e8
# 輸出2019-12-30 01:43:44.075758 I | raft: 92b79bbe6bd2706a is starting a new election at term 1657112019-12-30 01:43:44.075806 I | raft: 92b79bbe6bd2706a became candidate at term 1657122019-12-30 01:43:44.075819 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 1657122019-12-30 01:43:44.075832 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 1657122019-12-30 01:43:44.075844 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 1657122019-12-30 01:43:45.075783 I | raft: 92b79bbe6bd2706a is starting a new election at term 1657122019-12-30 01:43:45.075818 I | raft: 92b79bbe6bd2706a became candidate at term 1657132019-12-30 01:43:45.075830 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 1657132019-12-30 01:43:45.075840 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 1657132019-12-30 01:43:45.075849 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 1657132019-12-30 01:43:45.928418 E | etcdserver: publish error: etcdserver: request timed out2019-12-30 01:43:46.363974 I | etcdmain: rejected connection from "192.168.100.181:35914" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.364006 I | etcdmain: rejected connection from "192.168.100.181:35912" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.477058 I | etcdmain: rejected connection from "192.168.100.181:35946" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.483326 I | etcdmain: rejected connection from "192.168.100.181:35944" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.575790 I | raft: 92b79bbe6bd2706a is starting a new election at term 1657132019-12-30 01:43:46.575818 I | raft: 92b79bbe6bd2706a became candidate at term 1657142019-12-30 01:43:46.575829 I | raft: 92b79bbe6bd2706a received MsgVoteResp from 92b79bbe6bd2706a at term 1657142019-12-30 01:43:46.575839 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to 645060e8e879847c at term 1657142019-12-30 01:43:46.575848 I | raft: 92b79bbe6bd2706a [logterm: 82723, index: 84358879] sent MsgVote request to a25634eca298ea33 at term 1657142019-12-30 01:43:46.595828 I | etcdmain: rejected connection from "192.168.100.181:35962" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.597536 I | etcdmain: rejected connection from "192.168.100.181:35964" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.709028 I | etcdmain: rejected connection from "192.168.100.181:35970" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.714243 I | etcdmain: rejected connection from "192.168.100.181:35972" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")2019-12-30 01:43:46.928411 W | rafthttp: health check for peer a25634eca298ea33 could not connect: dial tcp 192.168.100.191:2380: getsockopt: connection refused...可以看到 etcd 一直在循環輸出上面的錯誤日誌直到超時退出,從裡面可以提取到一條關鍵錯誤,就是 error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid。這個錯誤對於經常維護 k8s 集群的朋友可能很熟悉了,又是證書到期了。
這個集群有三臺 master,分別是 171、181和191,可以從錯誤信息前看到是在請求 181 時出現了證書驗證失敗的問題,我們登陸 181 機器來驗證錯誤:
# 進入 k8s 證書目錄$ cd /etc/kubernetes/pki
# 查看證書到期時間$ openssl x509 -in etcd/server.crt -noout -text |grep ' Not '
# 輸出Not Before: Dec 26 08:12:11 2018 GMTNot After : Dec 26 08:12:11 2019 GMT經過排查,發現 k8s 的相關證書都沒事,但是 etcd 的證書都到期了。關於 k8s 需要的證書可以看這篇文章,接下來我們就來解決問題。
Kubeadm安裝的K8S集群1年證書過期問題的解決思路
問題解決注意,由於 k8s 版本問題,這一部分的內容可能和你的不太一樣,我所使用的版本如下:
root@master1:~# kubelet --versionKubernetes v1.13.1
root@master1:~# kubeadm versionkubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:36:44Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}如果版本相差過大的話請進行百度,相關的解決方案還是挺多的,下面解決方案請先配合 -h 使用,注意:以下操作會導致服務停止,請謹慎執行:
備份原始文件$ cd /etc $ cp -r kubernetes kubernetes.bak重新生成證書重新生成證書需要集群初始化時的配置文件,我的配置文件kubeadm.yaml如下:
kind: ClusterConfigurationapiVersion: kubeadm.k8s.io/v1beta1controlPlaneEndpoint: "192.168.100.170:6443"apiServer: certSANS: - master1 - master2 - master3 - 192.168.100.170 - 192.168.100.171 - 192.168.100.181 - 192.168.100.191其中 192.168.100.170 是 VIP,171、181、191分別對應master1、master2、master3主機。接下來使用配置文件重新籤發證書,每個管理節點都要執行:
$ kubeadm init phase certs all --config=kubeadm.yaml重新生成配置文件$ kubeadm init phase kubeconfig all --config kubeadm.yaml這個命令也需要每個管理節點都執行一次,被重新生成的配置文件包括下列幾個:
重啟管理節點的 k8s重啟 etcd,apiserver,controller-manager,scheduler 容器,一般情況下 kubectl 都可以正常使用了,記得kubectl get nodes查看節點的狀態。
重新生成工作節點的配置文件如果上一步查看的工作節點的狀態還是為 NotReady 的話,就需要重新進行生成,如果你根證書也更換了的話就會導致這個問題,工作節點的證書也會失效,直接備份並移除下面的證書並重啟 kubelet 即可:
$ mv /var/lib/kubelet/pki /var/lib/kubelet/pki.bak$ systemctl daemon-reload && systemctl restart kubelet如果不行的話就直接把管理節點的/etc/kubernetes/pki/ca.crt複製到對應工作節點的相同目錄下然後再次啟動 kubelet。等待三分鐘左右應該就可以在管理節點上看到該工作節點的狀態變為Ready。
總結k8s 的證書只有一年的設置確定有點坑,雖然為了讓使用者更新到最新版本的本意是好的。如果你現在 k8s 集群還是正常但是並沒有執行過證書更新操作的話,請及時查看你的證書到期時間,等到證書到期就為時已晚了。
參考Kubeadm安裝的K8S集群1年證書過期問題的解決思路Troubleshooting kubectl Error: The connection to the server x.x.x.x:6443 was refused – did you specify the right host or port?