Issue
A few days ago I was doing an upgrade over a Kubernetes control plane in ClusterAPI which only had one node (this client didn’t have enough hardware for HA at the moment). I was using Mikroways latest vSphere CAPI Helm Chart and I accidentaly added a wrong value in the KubeApiServer extra args section.
Specifically in this value: .kubeadm.clusterConfiguration.apiServer.extraArgs
The thing was that after applying the changes, a new node was created but it didn’t register correctly. KubeApiServer couldn’t be added to the cluster because of the incorrect values in its arguments and was stuck in this state.
I was able to check that by connecting to the node and checking restarted containers:
# Connect to the new control plane node
ssh <new-control-plane-node>
# Check logs of restarted container
crictl logs <restarted-container>
After checking the KubeApiServer container, I observed a error message that was similar to this:
Error: oidc-client-id is missing
A manual procedure was needed to unlock this control plane node.
So, I thought maybe if I upgrade the helm release with the correct values and delete the CAPI machine then a new node would register to the cluster and the upgrade will finish successfully.
# Upgrade the release
helm upgrade <vsphere-chart> <cluster-name> -f values-<cluster-name>.yaml
# Delete the stuck machine
kubectl delete machine <stuck-control-plane-node>
# Check the state of the cluster
clusterctl describe cluster <cluster-name>
After I did that, the stuck node was deleted but I lost access to the cluster.
Debugging
First, I had to check up what was going on in the left control node.
# Connect to the control plane node
ssh <control-plane-node>
# Check for issues in kubelet
systemctl status kubelet
After that I observed that kubelet was working as intended. So I had to look at the containers.
# Looking for restarted containers
crictl ps -a
# See logs in restarted containers
crictl logs <restarted-container>
I noticed that the kube-apiserver container was restarting with the following error in their logs:
Error: context deadline exceeded
Also, I saw that the ETCD container was restarting with the same error. In addition, in the logs there was a timeout error related to the IP of the old node:
Error: timeout <old-node-ip>
Apparently the deleted node was able to register to the ETCD cluster and the healthy one couldn’t refresh the member list. ETCD was unhealthy because it was trying to join a nonexistent member.
Solution
After a bit of research I found, in the ETCD official
documentation
about disaster recovery, a flag called force-new-cluster
that basically forces
the recreation of the cluster by deleting all current members and adding himself.
What I had to do was to add this flag to the ETCD container. So I connected to
the node and then I remembered that K8s static pod manifests are loadad from
this folder /etc/kubernetes/manifest/
and there was one .yaml in particular
for ETCD (/etc/kubernetes/manifest/etcd.yaml
). Next step was to add the flag
--force-new-cluster
to it’s container arguments and restart kubelet.
# Connect to the node
ssh <control-plane-node>
# Edit /etc/kubernetes/manifest/etcd.yaml to add --force-new-cluster
vi /etc/kubernetes/manifest/etcd.yaml
# Restart kubelet
systemctl restart kubelet
This caused the ETCD cluster to start. At last K8s cluster was accessible and the upgrade of the values was done successfully.
Conclusion
Don’t surrender!