Issue

A few days ago I was doing an upgrade over a Kubernetes control plane in ClusterAPI which only had one node (this client didn’t have enough hardware for HA at the moment). I was using Mikroways latest vSphere CAPI Helm Chart and I accidentaly added a wrong value in the KubeApiServer extra args section.

Specifically in this value: .kubeadm.clusterConfiguration.apiServer.extraArgs

The thing was that after applying the changes, a new node was created but it didn’t register correctly. KubeApiServer couldn’t be added to the cluster because of the incorrect values in its arguments and was stuck in this state.

I was able to check that by connecting to the node and checking restarted containers:

# Connect to the new control plane node
ssh <new-control-plane-node>
# Check logs of restarted container
crictl logs <restarted-container>

After checking the KubeApiServer container, I observed a error message that was similar to this:

Error: oidc-client-id is missing

A manual procedure was needed to unlock this control plane node.

So, I thought maybe if I upgrade the helm release with the correct values and delete the CAPI machine then a new node would register to the cluster and the upgrade will finish successfully.

# Upgrade the release
helm upgrade <vsphere-chart> <cluster-name> -f values-<cluster-name>.yaml
# Delete the stuck machine
kubectl delete machine <stuck-control-plane-node>
# Check the state of the cluster
clusterctl describe cluster <cluster-name>

After I did that, the stuck node was deleted but I lost access to the cluster.

Debugging

First, I had to check up what was going on in the left control node.

# Connect to the control plane node
ssh <control-plane-node>
# Check for issues in kubelet
systemctl status kubelet

After that I observed that kubelet was working as intended. So I had to look at the containers.

# Looking for restarted containers
crictl ps -a
# See logs in restarted containers
crictl logs <restarted-container>

I noticed that the kube-apiserver container was restarting with the following error in their logs:

Error: context deadline exceeded

Also, I saw that the ETCD container was restarting with the same error. In addition, in the logs there was a timeout error related to the IP of the old node:

Error: timeout <old-node-ip>

Apparently the deleted node was able to register to the ETCD cluster and the healthy one couldn’t refresh the member list. ETCD was unhealthy because it was trying to join a nonexistent member.

Solution

After a bit of research I found, in the ETCD official documentation about disaster recovery, a flag called force-new-cluster that basically forces the recreation of the cluster by deleting all current members and adding himself. What I had to do was to add this flag to the ETCD container. So I connected to the node and then I remembered that K8s static pod manifests are loadad from this folder /etc/kubernetes/manifest/ and there was one .yaml in particular for ETCD (/etc/kubernetes/manifest/etcd.yaml). Next step was to add the flag --force-new-cluster to it’s container arguments and restart kubelet.

# Connect to the node
ssh <control-plane-node>
# Edit /etc/kubernetes/manifest/etcd.yaml to add --force-new-cluster
vi /etc/kubernetes/manifest/etcd.yaml
# Restart kubelet
systemctl restart kubelet

This caused the ETCD cluster to start. At last K8s cluster was accessible and the upgrade of the values was done successfully.

Conclusion

Don’t surrender!

Issue#

Debugging#

Solution#

Conclusion#

Issue

Debugging

Solution

Conclusion