9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
redis源码:
// 主机已经是一个出错节点了,自己作为从机可以升级为主机
void clusterHandleSlaveFailover(void) {
......
// 故障修复超时,重新启动故障修复
if (auth_age > auth_retry_time) { // 两次故障修复间隔不能过短
// 更新一些时间
......
redisLog(REDIS_WARNING,
"Start of election delayed for %lld milliseconds "
"(rank #%d, offset %lld).",
server.cluster->failover_auth_time - mstime(),
server.cluster->failover_auth_rank,
replicationGetSlaveOffset());
// 告知其他从机
/* Now that we have a scheduled election, broadcast our offset
* to all the other slaves so that they'll updated their offsets
* if our offset is better. */
clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
return;
}
......
// 开头投票
/* Ask for votes if needed. */
if (server.cluster->failover_auth_sent == 0) {
server.cluster->currentEpoch++;
server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
redisLog(REDIS_WARNING,"Starting a failover election for epoch %llu.",
(unsigned long long) server.cluster->currentEpoch);
clusterRequestFailoverAuth();
server.cluster->failover_auth_sent = 1;
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
CLUSTER_TODO_UPDATE_STATE|
CLUSTER_TODO_FSYNC_CONFIG);
return; /* Wait for replies. */
}
}
3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352).
3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18.
3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master.
3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover
3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail
3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up
3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238
3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost.
3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok
.................
(3)新的主从通信成功,主从关系确定
5. 其他机器开始,认定新的slave
7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
6. 完整日志
(1). 10.10.81.94:7497(主)
3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail
3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up
3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238
3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost.
3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok
3022:S 26 Nov 10:35:16.061 * Connecting to MASTER 10.10.79.150:6389
3022:S 26 Nov 10:35:16.061 * MASTER <-> SLAVE sync started
3022:S 26 Nov 10:35:16.062 * Non blocking connect for SYNC fired the event.
3022:S 26 Nov 10:35:16.062 * Master replied to PING, replication can continue...
3022:S 26 Nov 10:35:16.062 * Partial resynchronization not possible (no cached master)
3022:S 26 Nov 10:35:16.062 * Full resync from master: 80ac45a2b236fc44ebbef3d851e535bcbdd6367b:21516181353
3022:S 26 Nov 10:35:45.273 * MASTER <-> SLAVE sync: receiving 494097946 bytes from master
3022:S 26 Nov 10:35:51.052 * MASTER <-> SLAVE sync: Flushing old data
3022:S 26 Nov 10:36:22.165 * MASTER <-> SLAVE sync: Loading DB in memory
3022:S 26 Nov 10:37:44.355 * MASTER <-> SLAVE sync: Finished with success
3022:S 26 Nov 10:37:44.440 * Background append only file rewriting started by pid 4036
3022:S 26 Nov 10:38:34.988 * AOF rewrite child asks to stop sending diffs.
4036:C 26 Nov 10:38:34.988 * Parent agreed to stop sending diffs. Finalizing AOF...
4036:C 26 Nov 10:38:34.988 * Concatenating 0.02 MB of AOF diff received from parent.
4036:C 26 Nov 10:38:34.989 * SYNC append only file rewrite performed
4036:C 26 Nov 10:38:35.029 * AOF rewrite: 5 MB of memory used by copy-on-write
3022:S 26 Nov 10:38:35.127 * Background AOF rewrite terminated with success
3022:S 26 Nov 10:38:35.127 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
3022:S 26 Nov 10:38:35.128 * Background AOF rewrite finished successfully
(2) 10.10.79.150:6389(从)
3173:S 26 Nov 10:35:10.934 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352).
3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18.
3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master.
3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover
3173:M 26 Nov 10:35:11.660 # Connection with master lost.
3173:M 26 Nov 10:35:11.660 * Caching the disconnected master state.
3173:M 26 Nov 10:35:11.660 * Discarding previously cached master state.
3173:M 26 Nov 10:35:15.605 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
3173:M 26 Nov 10:35:16.071 * Slave 10.10.81.94:7497 asks for synchronization
3173:M 26 Nov 10:35:16.071 * Full resync requested by slave 10.10.81.94:7497
3173:M 26 Nov 10:35:16.071 * Starting BGSAVE for SYNC with target: disk
3173:M 26 Nov 10:35:16.188 * Background saving started by pid 23531
3173:M 26 Nov 10:35:18.091 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3173:M 26 Nov 10:35:31.056 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3173:M 26 Nov 10:35:40.040 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
23531:C 26 Nov 10:35:45.048 * DB saved on disk
23531:C 26 Nov 10:35:45.113 * RDB: 6 MB of memory used by copy-on-write
3173:M 26 Nov 10:35:45.281 * Background saving terminated with success
3173:M 26 Nov 10:35:51.059 * Synchronization with slave 10.10.81.94:7497 succeeded
3173:M 26 Nov 10:36:11.107 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
3173:M 26 Nov 10:36:22.277 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(3) 10.10.81.95:7497(主)
14795:M 26 Nov 10:35:10.309 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
14795:M 26 Nov 10:35:11.034 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
14795:M 26 Nov 10:35:14.979 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
14795:M 26 Nov 10:36:20.480 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
14795:M 26 Nov 10:36:31.652 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(4) 10.10.79.157:6390(从)
29177:S 26 Nov 10:35:10.939 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
29177:S 26 Nov 10:35:15.609 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
29177:S 26 Nov 10:36:11.111 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
29177:S 26 Nov 10:36:22.281 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(5) 10.10.83.180:6382(主)
9611:M 26 Nov 10:35:11.342 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
9611:M 26 Nov 10:35:16.013 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
9611:M 26 Nov 10:36:11.516 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
9611:M 26 Nov 10:36:22.686 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(6) 10.10.81.96:6391(从)
7925:S 26 Nov 10:35:10.260 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
7925:S 26 Nov 10:35:14.930 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(7) 10.10.81.97:7499(主)
4051:M 26 Nov 10:35:09.768 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
4051:M 26 Nov 10:35:10.507 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
4051:M 26 Nov 10:35:14.452 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
4051:M 26 Nov 10:36:09.955 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
4051:M 26 Nov 10:36:21.125 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(8) 10.10.78.52:6398(从)
22972:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
22972:S 26 Nov 10:35:15.635 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
22972:S 26 Nov 10:36:11.116 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
22972:S 26 Nov 10:36:22.286 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(9) 10.10.81.98:7496(主)
2062:M 26 Nov 10:35:10.908 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
2062:M 26 Nov 10:35:11.634 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
2062:M 26 Nov 10:35:15.599 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
2062:M 26 Nov 10:36:11.080 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
2062:M 26 Nov 10:36:22.250 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
(10) 10.10.78.53:6396(从)
19527:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
19527:S 26 Nov 10:35:15.615 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
19527:S 26 Nov 10:36:11.117 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
19527:S 26 Nov 10:36:22.287 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.