Adventures in Recovering Consul from Temporary Leadership Collision
This is an exercise in recovering from an honest but potentially hazardous mistake: briefly joining together two consul clusters, each with their own leader. In this case, the new leader had nothing in its catalog, so it decided the catalog was now empty, ignoring the catalog from the previous master.
That is not acceptable, so how do we repair? One method is to restore the catalog from backup, though this may have implications which warrant consideration, based on the nuances of your deployment. Either way, I took this as an opportunity to push on Consul’s limits to explore what was possible and what a recovery would look like (and because I was already sorely disappointed with similar failure scenarious from etcd, I wanted to see if Consul would do better).
Restoring an older Snapshot
First we stop the leaders:
user@master:~# salt 'leaders*' cmd.run 'service consul stop'
leaders-i-ac20797e:
consul stop/waiting
leaders-i-e8a8190f:
consul stop/waiting
leaders-i-0d4a32b5:
consul stop/waiting
Next, we review the existing snapshots:
user@master:~# salt 'leaders*' cmd.run 'ls -alh /home/consul/tmp/raft/snapshots'
leaders-i-ac20797e:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 5 08:27 .
drwxr-x--- 3 consul consul 4.0K Oct 12 05:53 ..
drwxr-x--- 2 consul consul 4.0K Jan 22 20:56 221-155666-1453496179313
drwxr-x--- 2 consul consul 4.0K Mar 5 08:27 870-213018-1457166432286
leaders-i-e8a8190f:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 6 19:52 .
drwxr-x--- 3 consul consul 4.0K Oct 7 14:34 ..
drwxr-x--- 2 consul consul 4.0K Mar 2 11:10 870-213018-1456917007147
drwxr-x--- 2 consul consul 4.0K Mar 6 19:52 877-219344-1457293933762
leaders-i-0d4a32b5:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 2 11:14 .
drwxr-x--- 3 consul consul 4.0K Oct 7 14:34 ..
drwxr-x--- 2 consul consul 4.0K Feb 25 18:50 861-204830-1456426245436
drwxr-x--- 2 consul consul 4.0K Mar 2 11:14 870-213022-1456917280513
We’re going to fall back to the snapshot from “Mar 5 08:27 870-213018-1457166432286”, so we will remove the “Mar 6 19:52 877-219344-1457293933762” one:
user@master:~# salt 'leaders*' cmd.run 'mkdir /root/consul-recovery && cp -rp /home/consul/tmp /root/consul-recovery/ && mv /home/consul/tmp/raft/snapshots/877-219344-1457293933762 /root/consul-recovery/'
leaders-i-ac20797e:
mv: cannot stat '/home/consul/tmp/raft/snapshots/877-219344-1457293933762': No such file or directory
leaders-i-0d4a32b5:
mv: cannot stat '/home/consul/tmp/raft/snapshots/877-219344-1457293933762': No such file or directory
leaders-i-e8a8190f:
user@master:~# salt 'leaders*' cmd.run 'ls -alh /home/consul/tmp/raft/snapshots'
leaders-i-ac20797e:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 5 08:27 .
drwxr-x--- 3 consul consul 4.0K Oct 12 05:53 ..
drwxr-x--- 2 consul consul 4.0K Jan 22 20:56 221-155666-1453496179313
drwxr-x--- 2 consul consul 4.0K Mar 5 08:27 870-213018-1457166432286
leaders-i-e8a8190f:
total 12K
drwxr-x--- 3 consul consul 4.0K Mar 6 19:57 .
drwxr-x--- 3 consul consul 4.0K Oct 7 14:34 ..
drwxr-x--- 2 consul consul 4.0K Mar 2 11:10 870-213018-1456917007147
leaders-i-0d4a32b5:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 2 11:14 .
drwxr-x--- 3 consul consul 4.0K Oct 7 14:34 ..
drwxr-x--- 2 consul consul 4.0K Feb 25 18:50 861-204830-1456426245436
drwxr-x--- 2 consul consul 4.0K Mar 2 11:14 870-213022-1456917280513
Great! now lets start them back up and see what happens
Let’s check our backup first:
user@master:~# salt 'leaders*' cmd.run 'ls -alh /root/consul-recovery/'
leaders-i-ac20797e:
total 12K
drwxr-x--- 3 user root 4.0K Mar 6 19:56 .
drwx------ 5 user root 4.0K Mar 6 19:56 ..
drwxr-x--- 5 consul consul 4.0K Mar 5 08:26 tmp
leaders-i-0d4a32b5:
total 12K
drwxr-x--- 3 user root 4.0K Mar 6 19:57 .
drwx------ 6 user root 4.0K Mar 6 19:57 ..
drwxr-x--- 5 consul consul 4.0K Feb 18 08:58 tmp
leaders-i-e8a8190f:
total 16K
drwxr-x--- 4 user root 4.0K Mar 6 19:57 .
drwx------ 5 user root 4.0K Mar 6 19:57 ..
drwxr-x--- 2 consul consul 4.0K Mar 6 19:52 877-219344-1457293933762
drwxr-x--- 5 consul consul 4.0K Feb 18 08:55 tmp
user@master:~# salt 'leaders*' cmd.run 'ls -alh /root/consul-recovery/tmp/'
leaders-i-ac20797e:
total 20K
drwxr-x--- 5 consul consul 4.0K Mar 5 08:26 .
drwxr-x--- 3 user root 4.0K Mar 6 19:56 ..
drwxr-x--- 3 consul consul 4.0K Oct 12 05:53 raft
drwxr-x--- 2 consul consul 4.0K Oct 12 05:53 serf
drwxr-x--- 3 consul consul 4.0K Mar 5 08:27 tmp
leaders-i-e8a8190f:
total 20K
drwxr-x--- 5 consul consul 4.0K Feb 18 08:55 .
drwxr-x--- 4 user root 4.0K Mar 6 19:57 ..
drwxr-x--- 3 consul consul 4.0K Oct 7 14:34 raft
drwxr-x--- 2 consul consul 4.0K Oct 7 14:33 serf
drwxr-x--- 3 consul consul 4.0K Feb 18 08:55 tmp
leaders-i-0d4a32b5:
total 20K
drwxr-x--- 5 consul consul 4.0K Feb 18 08:58 .
drwxr-x--- 3 user root 4.0K Mar 6 19:57 ..
drwxr-x--- 3 consul consul 4.0K Oct 7 14:34 raft
drwxr-x--- 2 consul consul 4.0K Oct 7 14:33 serf
drwxr-x--- 3 consul consul 4.0K Feb 18 08:58 tmp
Next we need to select one of these leaders, and remove the data dir for the others:
user@master:~# salt 'leaders*' cmd.run 'ls -alh /home/consul/tmp/raft/snapshots'
leaders-i-ac20797e:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 6 20:13 .
drwxr-x--- 3 consul consul 4.0K Oct 12 05:53 ..
drwxr-x--- 2 consul consul 4.0K Jan 22 20:56 221-155666-1453496179313
drwxr-x--- 2 consul consul 4.0K Mar 5 08:27 870-213018-1457166432286
leaders-i-e8a8190f:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 6 20:14 .
drwxr-x--- 3 consul consul 4.0K Mar 6 20:13 ..
drwxr-x--- 2 consul consul 4.0K Mar 2 11:10 870-213018-1456917007147
drwxr-x--- 2 user root 4.0K Mar 6 20:12 877-219344-1457293933762
leaders-i-0d4a32b5:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 6 20:14 .
drwxr-x--- 3 consul consul 4.0K Oct 7 14:34 ..
drwxr-x--- 2 consul consul 4.0K Feb 25 18:50 861-204830-1456426245436
drwxr-x--- 2 consul consul 4.0K Mar 2 11:14 870-213022-1456917280513
user@master:~# salt 'leaders-i-ac20797e' cmd.run 'mv /home/consul/tmp /home/consul/tmp-old'
leaders-i-ac20797e:
user@master:~# salt 'leaders-i-0d4a32b5' cmd.run 'mv /home/consul/tmp /home/consul/tmp-old'
leaders-i-0d4a32b5:
user@master:~# salt 'leaders*' cmd.run 'ls -alh /home/consul/tmp/raft/snapshots'
leaders-i-ac20797e:
ls: cannot access /home/consul/tmp/raft/snapshots: No such file or directory
leaders-i-e8a8190f:
total 16K
drwxr-x--- 4 consul consul 4.0K Mar 6 20:14 .
drwxr-x--- 3 consul consul 4.0K Mar 6 20:13 ..
drwxr-x--- 2 consul consul 4.0K Mar 2 11:10 870-213018-1456917007147
drwxr-x--- 2 user root 4.0K Mar 6 20:12 877-219344-1457293933762
leaders-i-0d4a32b5:
ls: cannot access /home/consul/tmp/raft/snapshots: No such file or directory
Now, on each leader, we need to edit the peers.json file, to ensure we have the IPs of the leaders we want in there:
user@leaders-i-0d4a32b5:~# cat /home/consul/tmp/raft/peers.json
["10.10.20.8:8300","10.10.21.10:8300","10.10.20.14:8300"] < ADD 3 more from 10.10.22 and 10.10.23
echo '["10.10.20.8:8300","10.10.21.10:8300","10.10.20.14:8300"]' > /home/consul/tmp/raft/peers.json
Great.. now we start the leaders…
user@master:~# salt 'leaders*' cmd.run 'service consul start'
leaders-i-ac20797e:
consul start/running, process 1038
leaders-i-e8a8190f:
consul start/running, process 13838
leaders-i-0d4a32b5:
consul start/running, process 13274
…and then we go to one of the leaders and watch the progress with consul monitor
.
Eventually we should be able to see the leadership election happen, it’ll look something like:
2016/03/06 20:49:34 [INFO] consul: adding server leaders-i-ac20797e (Addr: 10.10.20.14:8300) (DC: us-east-1)
2016/03/06 20:49:35 [WARN] raft: Election timeout reached, restarting election
2016/03/06 20:49:35 [INFO] raft: Node at 10.10.20.8:8300 [Candidate] entering Candidate state
2016/03/06 20:49:35 [ERR] raft: Failed to make RequestVote RPC to 10.10.21.10:8300: dial tcp 10.10.21.10:8300: connection refused
2016/03/06 20:49:35 [WARN] raft: Remote peer 10.10.20.14:8300 does not have local node 10.10.20.8:8300 as a peer
2016/03/06 20:49:35 [INFO] raft: Election won. Tally: 2
2016/03/06 20:49:35 [INFO] raft: Node at 10.10.20.8:8300 [Leader] entering Leader state
2016/03/06 20:49:35 [INFO] consul: cluster leadership acquired
2016/03/06 20:49:35 [INFO] consul: New leader elected: leaders-i-e8a8190f
For more info, see https://www.consul.io/docs/guides/outage.html