There is a very common misunderstanding that the leader election algorithm in zookeeper is paxos or fast paxos. The leader election algorithm is not paxos or fast paxos, please consider the following facts:
There is no the concept of proposal number in the leader election in zookeeper. the proposal number is a key concept to paxos. Some one think the epoch is the proposal number, but different followers may produce proposal with the same epoch which is not allowed in paxos.
Fast paxos requires at least 3t + 1 acceptors, where t is the number of servers which are allowed to fail [3]. This is conflict with the fact that a zookeeper cluster with 3 servers works well even if one server failed.
The leader election algorithm must make sure P1. However paxos does provide such guarantee.
In paxos, a leader is also required to achieve progress. There are some similarities between the leader in paxos and the leader in zookeeper. Even if more than one servers believe they are the leader, the consistency is preserved both in zookeeper and in paxos. this is very clearly discussed in [1] and [2].
然后是作者三次对比
1)邮件列表
Our protocol instead, has only two phases, just like a two-phase
commit protocol. Of course, for Paxos, we can ignore the first phase in runs in
which we have a single proposer as we can run phase 1 for multiple instances at
a time, which is what Ben called previously Multi-Paxos, I believe. The trick
with skipping phase 1 is to deal with leader switching.
2)出书的访谈
We made a few interesting observations about Paxos when contrasting it to Zab, like problems you could run into if you just implemented Paxos alone. Not that Paxos is broken or anything, just that in our setting, there were some properties it was not giving us. Some people still like to map Zab to Paxos, and they are not completely off, but the way we see it, Zab matches a service like ZooKeeper well.
zk的分布式一致性算法有了个名称叫Zab
3)论文
We use an algorithm that shares some of the character- istics of Paxos, but that combines transaction logging needed for consensus with write-ahead logging needed for data tree recovery to enable an efficient implementa- tion.
there can be at most one leader (proposer) at any time, and we guarantee this by making sure
that a quorum of replicas recognize the leader as a leader by committing to an
epoch change. This change in epoch also allows us to get unique zxids since the
epoch forms part of the zxid.
参考文献
[1] Reed, B., & Junqueira, F. P. (2008). A simple totally ordered broadcast protocol. Second
Workshop on Large-Scale Distributed Systems and Middleware (LADIS 2008). Yorktown
Heights, NY: ACM. ISBN: 978-1-60558-296-2.
[2] Lamport, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 1825.
[3] F. Junqueira, Y. Mao, and K. Marzullo. Classic paxos vs. fast paxos: caveat emptor. In
Proceedings of the 3rd USENIX/IEEE/IFIP Workshop on Hot Topics in System Dependability
(HotDep.07). Citeseer, 2007.
[4]O'Reilly.ZooKeeper.Distributed process coordination.2013
[5] http://agapple.iteye.com/blog/1184023 zookeeper项目使用几点小结