Issue

A few days ago I was doing an upgrade over a Kubernetes control plane in ClusterAPI which only had one node (this client didn’t have enough hardware for HA at the moment). I was using Mikroways latest vSphere CAPI Helm Chart and I accidentaly added a wrong value in the KubeApiServer extra args section.

Specifically in this value: .kubeadm.clusterConfiguration.apiServer.extraArgs

The thing was that after applying the changes, a new node was created but it didn’t register correctly. KubeApiServer couldn’t be added to the cluster because of the incorrect values in its arguments and was stuck in this state.

I was able to check that by connecting to the node and checking restarted containers:

# Connect to the new control plane node
ssh <new-control-plane-node>
# Check logs of restarted container
crictl logs <restarted-container>

After checking the KubeApiServer container, I observed a error message that was similar to this:

Error: oidc-client-id is missing

A manual procedure was needed to unlock this control plane node.

So, I thought maybe if I upgrade the helm release with the correct values and delete the CAPI machine then a new node would register to the cluster and the upgrade will finish successfully.

# Upgrade the release
helm upgrade <vsphere-chart> <cluster-name> -f values-<cluster-name>.yaml
# Delete the stuck machine
kubectl delete machine <stuck-control-plane-node>
# Check the state of the cluster
clusterctl describe cluster <cluster-name>

After I did that, the stuck node was deleted but I lost access to the cluster.

Debugging

First, I had to check up what was going on in the left control node.

# Connect to the control plane node
ssh <control-plane-node>
# Check for issues in kubelet
systemctl status kubelet

After that I observed that kubelet was working as intended. So I had to look at the containers.

# Looking for restarted containers
crictl ps -a
# See logs in restarted containers
crictl logs <restarted-container>

I noticed that the kube-apiserver container was restarting with the following error in their logs:

Error: context deadline exceeded

Also, I saw that the ETCD container was restarting with the same error. In addition, in the logs there was a timeout error related to the IP of the old node:

Error: timeout <old-node-ip>

Apparently the deleted node was able to register to the ETCD cluster and the healthy one couldn’t refresh the member list. ETCD was unhealthy because it was trying to join a nonexistent member.

Solution

After a bit of research I found, in the ETCD official documentation about disaster recovery, a flag called force-new-cluster that basically forces the recreation of the cluster by deleting all current members and adding himself. What I had to do was to add this flag to the ETCD container. So I connected to the node and then I remembered that K8s static pod manifests are loadad from this folder /etc/kubernetes/manifest/ and there was one .yaml in particular for ETCD (/etc/kubernetes/manifest/etcd.yaml). Next step was to add the flag --force-new-cluster to it’s container arguments and restart kubelet.

# Connect to the node
ssh <control-plane-node>
# Edit /etc/kubernetes/manifest/etcd.yaml to add --force-new-cluster
vi /etc/kubernetes/manifest/etcd.yaml
# Restart kubelet
systemctl restart kubelet

This caused the ETCD cluster to start. At last K8s cluster was accessible and the upgrade of the values was done successfully.

Conclusion

Don’t surrender!