Distributed consensus evolved from Paxos's mathematical rigor to Raft's design-for-understandability philosophy. Paxos, while theoretically elegant with its two-phase prepare/promise and accept/accepted protocol, has a notorious reputation for being difficult to understand and implement correctly. Raft explicitly prioritizes understandability by decomposing consensus into leader election, log replication, and safety properties, with a strong leader model that simplifies reasoning. Both algorithms tolerate (n-1)/2 failures and achieve equivalent theoretical guarantees, but Raft's prescriptive approach has led to more correct implementations and faster developer onboarding. Modern variants like Flexible Paxos and Multi-Raft address specific performance needs, while leaderless approaches like EPaxos remove bottlenecks. The key insight: an understandable algorithm that teams can implement correctly is more valuable than a theoretically pure but opaque one. The best consensus algorithm isn't determined by mathematical properties alone, but by the intersection of correctness, understandability, and practical implementation concerns.
The Art of Consensus: From Paxos to Raft and Beyond
In distributed systems, achieving agreement among multiple nodes is deceptively difficult. When networks partition, nodes fail, and messages get delayed or lost, how can we ensure all nodes agree on a single value? This is the consensus problem—one of the most fundamental challenges in distributed computing.
The journey from Paxos to Raft represents not just technical evolution, but a shift in how we think about algorithm design: Is it more important to be mathematically rigorous or to be understandable?
The Consensus Problem: Why Is It So Hard?
The FLP Impossibility Theorem
In 1985, Fischer, Lynch, and Paterson proved a shocking result: in an asynchronous distributed system, consensus is impossible if even a single node can fail.
This doesn't mean consensus is impossible in practice—it means we must make trade-offs:
- Synchrony assumptions: Bounded message delays
- Failure models: Only certain types of failures
- Liveness vs. Safety: Can't guarantee both always
The CAP Theorem
Another fundamental constraint: you can't have all three of:
- Consistency: All nodes see the same data
- Availability: Every request receives a response
- Partition tolerance: System continues despite network splits
Consensus algorithms choose CP (Consistency + Partition tolerance) over availability during partitions.
Paxos: The Elegant Monster
History and Impact
Leslie Lamport published Paxos in 1998 (though it was rejected in 1989). Despite its mathematical beauty, Paxos has a reputation for being notoriously difficult to understand and implement.
Fun fact: Lamport originally presented it as a story about a legislative consensus system on a fictional Greek island, which confused reviewers so much they rejected it!
The Core Algorithm
Paxos operates in two phases:
Phase 1: Prepare/Promise
- Proposer sends PREPARE(n) with proposal number n
- Acceptors promise not to accept proposals < n
- Acceptors return any already-accepted values
Phase 2: Accept/Accepted
- Proposer sends ACCEPT(n, v) with value v
- Acceptors accept if they haven't promised to ignore n
- Once majority accepts, value is chosen
"""Proposal number: (round, server_id) for uniqueness and total ordering"""
:
:
return <
return ==
:
:
"""Paxos Acceptor node"""
=
# Highest proposal number seen in Phase 1
: = None
# Highest proposal actually accepted
: = None
"""Phase 1: Respond to PREPARE message"""
# If we've already promised a higher number, reject
return
# Promise not to accept lower-numbered proposals
=
return
"""Phase 2: Respond to ACCEPT message"""
# If we've promised not to accept this proposal number, reject
return
# Accept the proposal
=
=
return
"""Paxos Proposer node"""
=
=
= 0
= // 2 + 1
"""
Propose a value using Paxos algorithm.
Returns the chosen value (may differ from proposed value).
"""
# Phase 1: Prepare
+= 1
=
=
=
# Need majority of promises
return None
# Check if any acceptor already accepted a value
=
# Use the value from the highest-numbered accepted proposal
=
=
# Phase 2: Accept
=
=
=
# Need majority of accepts
return None
return
# Example usage
=
=
=
# Proposer 1 proposes "A"
=
# Proposer 2 tries to propose "B" but will get "A" (already chosen)
=
Multi-Paxos: Achieving Consensus on a Log
Basic Paxos achieves consensus on a single value. Real systems need consensus on a sequence of values (a log). Multi-Paxos optimizes this:
"""Simplified Multi-Paxos for log consensus"""
=
=
: = # Decided log entries
: = None
= False
"""Leader election (simplified)"""
# In real Multi-Paxos, this uses Phase 1 of Paxos
= True
=
"""Append a value to the log (only leader can do this)"""
return False
# Skip Phase 1 (already done during leader election)
# Go directly to Phase 2
=
=
# Send ACCEPT to all acceptors
= 0
=
+= 1
= // 2 + 1
return True
return False
Why Paxos Is Hard
The problems with Paxos:
- Conceptual complexity: Two-phase protocol with subtle edge cases
- Implementation difficulty: Gap between algorithm and real systems
- No clear leader: Multiple proposers can conflict
- Configuration changes: Adding/removing nodes is tricky
As Mike Burrows (Google Chubby) said: "There are only two kinds of consensus algorithms: Paxos and wrong ones. But Paxos is too hard to understand."
Raft: Understandability as a Goal
Design Philosophy
In 2014, Diego Ongaro and John Ousterhout published Raft with an explicit goal: understandability. They decomposed consensus into:
- Leader election: One server becomes leader
- Log replication: Leader replicates entries to followers
- Safety: Ensure consistency guarantees
Leader Election
Raft servers can be in three states:
=
=
=
:
:
:
"""Raft consensus node"""
=
= # Other nodes
# Persistent state
= 0
: = None
: =
# Volatile state
=
= 0
= 0
# Leader state
: =
: =
# Timing
=
=
"""Random election timeout between 150-300ms"""
return
"""Called periodically to check timeouts"""
# Send heartbeats
# Check election timeout
# Check election timeout
"""Start leader election"""
=
+= 1
=
=
=
# Request votes from all peers
= 1 # Vote for self
=
+= 1
# Check if won election
= // 2 + 1
"""Transition to leader state"""
=
# Initialize leader state
=
= -1
# Send initial heartbeats
"""Handle RequestVote RPC"""
# If term is old, reject
return
# If term is newer, update and step down
=
=
= None
# Check if we can vote for this candidate
= and
=
=
return
"""Check if candidate's log is at least as up-to-date as ours"""
return True
= .
= - 1
return >
return >=
"""Leader sends periodic heartbeats (empty AppendEntries)"""
"""Handle AppendEntries RPC (also serves as heartbeat)"""
# Reset election timeout
=
# If term is old, reject
return
# If term is newer, update and step down
=
=
= None
# This is the current leader
=
return
Log Replication
Once elected, the leader:
- Receives commands from clients
- Appends to its log
- Replicates to followers
- Commits once majority have replicated
# ... (previous code)
"""Handle client request (only leader processes these)"""
return False
# Append to local log
=
# Replicate to followers
return True
"""Replicate log entries to all followers"""
=
# Entries to send
=
# Previous log entry for consistency check
= - 1
= .
=
# Update indices
= . + 1
= .
# Follower's log inconsistent, decrement nextIndex and retry
=
# Check if we can commit any entries
"""Update commit index if majority have replicated"""
# Count how many servers have replicated entry n
= 1 # Leader has it
+= 1
= // 2 + 1
# If majority have replicated and entry is from current term
=
Safety Properties
Raft guarantees several safety properties:
Election Safety: At most one leader per term
Leader Append-Only: Leader never overwrites or deletes entries
Log Matching: If two logs contain an entry with same index and term, they are identical up to that index
Leader Completeness: If a log entry is committed, it will be present in all future leaders
State Machine Safety: If a server has applied a log entry at a given index, no other server will apply a different entry at that index
Raft vs Paxos: The Great Debate
Similarities
Both algorithms:
- Achieve consensus in asynchronous environments
- Tolerate
(n-1)/2failures in a cluster ofnnodes - Are theoretically equivalent in what they can achieve
Differences
| Aspect | Paxos | Raft |
|---|---|---|
| Understandability | Notoriously difficult | Designed for understandability |
| Leader | Weak or no leader | Strong leader |
| Log structure | Can have gaps | Sequential, no gaps |
| Configuration changes | Complex | Explicit membership change |
| Implementation | Many variations | Single, well-defined algorithm |
The Understandability Revolution
Raft's understandability has measurable impact:
- Faster onboarding for new engineers
- Fewer implementation bugs
- Better mental models for debugging
- More implementations (etcd, Consul, CockroachDB, etc.)
Key insight: An algorithm that humans can understand and implement correctly is more valuable than a theoretically pure but opaque one.
Beyond Raft: Modern Consensus
Flexible Paxos
Allows different quorum sizes for prepare and accept phases:
"""
Flexible Paxos: Phase 1 and Phase 2 can use different quorum sizes
As long as Q1 + Q2 > N, safety is preserved
"""
assert + > ,
=
= # Phase 1 quorum
= # Phase 2 quorum
# Example: In 5-node cluster
# Traditional: Q1=3, Q2=3
# Flexible: Q1=2, Q2=4 (faster reads, slower writes)
# Flexible: Q1=4, Q2=2 (faster writes, slower reads)
Multi-Raft and Sharding
Large-scale systems use multiple Raft groups:
"""Multiple Raft groups for horizontal scaling"""
: =
# Each shard has its own Raft group
=
"""Route write to appropriate shard"""
= %
return
"""Route read to appropriate shard"""
= %
# Read from leader or with read quorum
return
Leaderless Consensus: EPaxos
Egalitarian Paxos removes the leader bottleneck:
- Any node can propose
- Commands are partially ordered (not totally ordered)
- Better load distribution
- More complex conflict resolution
Practical Considerations
Performance Optimizations
"""Raft with practical optimizations"""
# Pipeline: send multiple AppendEntries in parallel
= 10
# Batching: group multiple commands
= 100
=
# Read optimization: lease-based reads
= 0
"""Fast read without consensus (leader lease)"""
# Can read locally without consensus
return
# Fall back to linearizable read
return
"""Read that requires consensus (slower but safe)"""
# Append a no-op entry and wait for commit
# This ensures we see all committed writes
pass
Configuration Changes
Safely adding/removing nodes:
"""Raft with safe configuration changes"""
"""Safely add a new node to the cluster"""
# Joint consensus approach (Raft paper Section 6)
# 1. Append C_old,new to log
=
=
# 2. Wait for C_old,new to commit
# During this time, decisions require majority in BOTH configs
# 3. Append C_new to log
# 4. Wait for C_new to commit
Real-World Implementations
etcd (Raft)
Used by Kubernetes for cluster coordination:
# etcd uses Raft for consistency
# Member list shows Raft cluster
Google Spanner (Paxos)
Global-scale database using Paxos groups:
- Multiple Paxos groups (one per shard)
- TrueTime for global consistency
- Cross-datacenter replication
CockroachDB (Raft)
Distributed SQL database:
- Multi-Raft for horizontal scaling
- Ranges (data shards) each have a Raft group
- Automatic rebalancing
Conclusion: The Philosophy of Consensus
The evolution from Paxos to Raft teaches us profound lessons:
1. Correctness Isn't Enough
Paxos is provably correct, but that doesn't make it usable. Understandability is a first-class concern.
2. Decomposition Matters
Raft's decomposition into leader election, log replication, and safety made a complex problem tractable.
3. Strong Leadership Simplifies
A strong leader (Raft) is easier to reason about than weak/no leader (Paxos), even if theoretically equivalent.
4. Trade-offs Are Fundamental
All consensus algorithms trade off:
- Latency vs. Fault Tolerance
- Throughput vs. Consistency
- Simplicity vs. Flexibility
5. Implementation Matters
The gap between algorithm and implementation is where theory meets practice. Raft's prescriptive approach reduces this gap.
The ultimate insight: Consensus algorithms aren't just about getting machines to agree—they're about getting humans to agree on how machines should agree.
Paxos and Raft represent two philosophies:
- Paxos: Mathematical elegance and flexibility
- Raft: Pragmatic understandability and implementation
Both have their place. The choice depends on your priorities: Do you optimize for theoretical properties or human comprehension?
In the end, the best algorithm is the one your team can implement correctly and maintain confidently.
What consensus algorithm does your system use? Have you implemented Raft or Paxos? Share your experiences in the comments.