Challenge
High Availability (HA) Acura deployment has been unexpectedly rebooted due to power outage or accident, and the UI is no longer accessible - error 503.
All nodes have been shut down simultaneously. If there has been only one affected node, please
refer to this article.
Cause
Acura comprises a set of Docker containers managed by Kubernetes. Some of them, namely databases, may crash and be unable to restart correctly after an unexpected shutdown.
Solution
Establish an ssh connection to a node with Acura instance to make sure that the issue has to do with affected etcd containers.
In the CLI type the
kubectl get pods command to see the status of all available pods in the cluster. A healthy deployment should have all pods in
completed or
running states.
If "
pending" etcd containers have been discovered, perform the following steps depending on your Acura version.
3.3
Execute these commands:
kubectl scale sts etcd --replicas=0
sudo rm /acura/etcd/member_id
(do this on each of the nodes)
kubectl scale sts etcd --replicas=3
The following commands will help check whether the operation has been successful:
kubectl exec -ti etcd-0 -- etcdctl member list
kubectl exec -ti etcd-0 -- etcdctl cluster-health
3.4
Pick one node, preferably the one with the latest data, to perform the restore.
Etcd data has to be removed from all other nodes. If you selected the first node for restore, you will need to delete data on all other nodes with this command:
sudo rm -rf /acura/etcd/*
Now you can execute the pre-integrated restore script to handle the rest:
bash /acura/tools/restore_etcd.sh
After the script is completed, there should be one pod with etcd running. Kubernetes will then scale it to a required amount of pods.
The following commands will help check whether the operation has been successful:
kubectl exec -ti <etcd_pod_name> -- etcdctl member list
kubectl exec -ti
<etcd_pod_name>
-- etcdctl cluster-health