How to check and restore ETCD after a full shutdown of HA Acura deployment (error 503)

Challenge

High Availability (HA) Acura deployment has been unexpectedly rebooted due to power outage or accident, and the UI is no longer accessible - error 503.

All nodes have been shut down simultaneously.

Cause

Acura comprises a set of Docker containers managed by Kubernetes. Some of them, namely databases, may crash and be unable to restart correctly after an unexpected shutdown.

Solution

Establish an ssh connection to a node with Acura instance to make sure that the issue has to do with affected etcd containers.

In the CLI type the kubectl get pods command to see the status of all available pods in the cluster. A healthy deployment should have all pods in completed or running states.

If " pending" etcd containers have been discovered, perform the following steps depending on your Acura version.

3.3

Execute these commands:

kubectl scale sts etcd --replicas=0

sudo rm /acura/etcd/member_id (do this on each of the nodes)

kubectl scale sts etcd --replicas=3

The following commands will help check whether the operation has been successful:

kubectl exec -ti etcd-0 -- etcdctl member list

kubectl exec -ti etcd-0 -- etcdctl cluster-health

3.4

Pick one node, preferably the one with the latest data, to perform the restore.

Etcd data has to be removed from all other nodes. If you selected the first node for restore, you will need to delete data on all other nodes with this command:

sudo rm -rf /acura/etcd/*

Now you can execute the pre-integrated restore script to handle the rest:

bash /acura/tools/restore_etcd.sh

After the script is completed, there should be one pod with etcd running. Kubernetes will then scale it to a required amount of pods.

The following commands will help check whether the operation has been successful:

kubectl exec -ti <etcd_pod_name> -- etcdctl member list

kubectl exec -ti <etcd_pod_name> -- etcdctl cluster-health

Related Articles
Why Acura UI is not available after the deployment
Challenge Acura UI is not available after the deployment. Cause Certain components of Acura take a longer time to start. Solution Make sure that at least 20 minutes have passed after the initial deployment of Acura as it might take this time for all ...
Replication fails with an error: "Failed to create backup on device. Internal library error. Failed build snapshot sequence"
Challenge Replication fails with the following message in Acura's Events section or in the agent's logs: "Failed to create backup on device <...>: Internal library error. Failed to filter snapshots Failed to find end snapshot(<...>). Failed build ...
Cloud Site creation fails due to a "Resource CREATE failed: invalid availability zone" error
Challenge Cloud Site fails to launch and returns a similar error message: "Error launching cloudsite <ID> for customer <Name> Creation failed: Stack <ID> went to status CREATE_FAILED due to: Resource CREATE failed: ClientError: resources.WIN-H06N: An ...
Why a Windows VM starts on the Cloud Site on Azure with an error message
Challenge A Windows VM is starting on the Cloud Site on Azure with an error message. For example "No network". Cause If a Windows machine is not updated, it might experience problems running on Azure. Solution Please make sure that the updates ...
How to perform a connectivity check if a machine is shown offline in ACP
Challenge A source machine has a replication agent installed but it is shown in an offline state in ACP, although the machine has been properly discovered before. Solution One of the possible ways to check the connection to Acura is to use the telnet ...

How to check and restore etcd after a full shutdown of HA Acura deployment (error 503)

How to check and restore ETCD after a full shutdown of HA Acura deployment (error 503)

Challenge

Cause

Solution

Related Articles

Why Acura UI is not available after the deployment

Replication fails with an error: "Failed to create backup on device. Internal library error. Failed build snapshot sequence"

Cloud Site creation fails due to a "Resource CREATE failed: invalid availability zone" error

Why a Windows VM starts on the Cloud Site on Azure with an error message

How to perform a connectivity check if a machine is shown offline in ACP