How to restore an ETCD pod after a node returned to the cluster

Challenge

An Etcd pod on the restarted instance is stuck in “Pending” state, and the Etcd cluster can't be rebuilt.

Cause

If Acura is deployed as a HA cluster, it stays operable if one of the nodes goes down (gets turned off or destroyed). However, after returning the lost node back to the cluster, the clustered Etcd needs several manual actions in order to rebuild itself.

Solution

Some steps are required to return the cluster from the emergency mode back to normal:

1. Establish an SSH connection to the restored node and clean etcd data:

# sudo rm -rf /acura/etcd/*

2. Remove persistent volume claim from the PV on the node:

# kubectl patch pv $(kubectl get pv | grep etcd | awk '/Released/ {print $1}') --type merge --patch '{"spec": {"claimRef": null}}'

After that, the Etcd pod should start and join the cluster.

Related Articles
How to check and restore ETCD after a full shutdown of HA Acura deployment (error 503)
Challenge High Availability (HA) Acura deployment has been unexpectedly rebooted due to power outage or accident, and the UI is no longer accessible - error 503. All nodes have been shut down simultaneously. If there has been only one affected node, ...
Replication freezes although the agent is still sending heartbeats to Acura
Challenge Replication is stuck with a "Replicating" status in Acura Control Panel, although the agent's "Last seen" status is up-to-date and heartbeats are being sent. Cause A possible reason for such behavior might be that Acura and Hystax ...
Replication with VMware agent fails with the message: Backup creation failed with code 500
Challenge Replication with VMware agent fails with the following error message: "Backup creation failed with code 500, reason: Command '['umount', '/tmp/e9dc80e9-0a54-4991-9003-edeffd7e4920']' returned non-zero exit status 32" Cause In case of using ...
"Module build for the currently running kernel was skipped" message in console during Replication agent's installation
Challenge During the installation of a Linux replication agent (hlragent) with DKMS, the following message can be displayed in the console: Module build for the currently running kernel was skipped since the kernel source for this kernel does not ...
Initial replication fails with a message similar to: Failed to attach cloud volume ad887e79-9857-424c-bce9-17f3b42d4cef: 'NoneType' object has no attribute 'id'
Challenge Initial replication fails with a message similar to: Failed to attach cloud volume ad887e79-9857-424c-bce9-17f3b42d4cef: 'NoneType' object has no attribute 'id' Cause Different networks are specified for the cluster and for the agent, and ...

How to restore an ETCD pod after a node returned to the cluster

How to restore an ETCD pod after a node returned to the cluster

Challenge

Cause

Solution

Related Articles

How to check and restore ETCD after a full shutdown of HA Acura deployment (error 503)

Replication freezes although the agent is still sending heartbeats to Acura

Replication with VMware agent fails with the message: Backup creation failed with code 500

"Module build for the currently running kernel was skipped" message in console during Replication agent's installation

Initial replication fails with a message similar to: Failed to attach cloud volume ad887e79-9857-424c-bce9-17f3b42d4cef: 'NoneType' object has no attribute 'id'