Consensus in Distributed Systems: From Paxos to Raft
The Fundamental Problem
In a world where single points of failure are unacceptable, distributed systems reign supreme. Yet with distribution comes a profound challenge: how do independent nodes agree on a single truth? This is the consensus problem, and it lies at the heart of every reliable distributed system—from databases to blockchains, from cloud orchestration to distributed locks.
The consensus problem asks: given a collection of processes that can fail and communicate over an unreliable network, how can they agree on a single value? This seemingly simple question has profound implications for consistency, availability, and partition tolerance.
The CAP Theorem and Consensus
Before diving into consensus algorithms, we must understand the constraints. The CAP theorem tells us that in the presence of network partitions, we must choose between consistency and availability. Consensus algorithms are fundamentally about maintaining consistency—ensuring all nodes agree on the same sequence of operations.
# The consensus problem in pseudo-code
=
=
= None
"""
Propose a value for consensus.
Challenge: How do we ensure all nodes accept the same value,
even in the face of failures and network partitions?
"""
pass
"""
Learn the chosen value.
All correct nodes must learn the same value.
"""
=
Paxos: The Theoretical Foundation
Leslie Lamport's Paxos algorithm, introduced in 1989 (though not published until 1998), is the theoretical foundation of distributed consensus. Paxos is famously difficult to understand, leading Lamport to write multiple explanations, including the whimsical "The Part-Time Parliament."
The Paxos Protocol
Paxos operates with three roles:
- Proposers: Propose values for consensus
- Acceptors: Vote on proposed values
- Learners: Learn the chosen value
The protocol proceeds in two phases:
Phase 1: Prepare
// Paxos Phase 1: Prepare
type Proposer struct
func () PrepareResponse
type Acceptor struct
func (n int) PrepareResponse
Phase 2: Accept
// Paxos Phase 2: Accept
func (value interface) bool
func (n int, value interface) bool
The Complexity of Paxos
Paxos provides strong safety guarantees:
- Validity: Only proposed values can be chosen
- Agreement: At most one value is chosen
- Termination: Eventually, a value is chosen (under reasonable assumptions)
However, Paxos is notoriously difficult to implement correctly. The protocol handles multiple proposers competing simultaneously, message reordering, and partial failures. This complexity led to many incorrect implementations and the search for alternatives.
Multi-Paxos: Toward Practical Systems
Basic Paxos achieves consensus on a single value. Real systems need to agree on a sequence of values—a replicated log. Multi-Paxos extends basic Paxos by electing a stable leader who can skip Phase 1 for subsequent proposals.
// Multi-Paxos with leader election
Raft: Understandability as a Design Goal
In 2013, Diego Ongaro and John Ousterhout introduced Raft with an explicit goal: understandability. They recognized that consensus algorithms are implemented by humans, and human understanding is crucial for correctness.
Raft decomposes consensus into three independent subproblems:
- Leader Election: How to select a leader when one fails
- Log Replication: How the leader replicates its log to followers
- Safety: How to ensure consistency across failures
Leader Election
Raft nodes exist in three states: follower, candidate, or leader.
Log Replication
Once elected, the leader handles all client requests and replicates its log to followers.
Safety Properties
Raft's safety guarantees ensure that:
- Election Safety: At most one leader per term
- Leader Append-Only: Leaders never overwrite or delete entries
- Log Matching: If two logs contain an entry with the same index and term, all preceding entries are identical
- Leader Completeness: If a log entry is committed in a given term, it will be present in the leaders' logs for all higher terms
- State Machine Safety: If a server has applied a log entry at a given index, no other server will apply a different entry for that index
Comparing Paxos and Raft
| Aspect | Paxos | Raft |
|---|---|---|
| Understandability | Complex, multiple explanations exist | Designed for understandability |
| Leader | Multiple proposers can compete | Single leader, stronger leadership |
| Log structure | Allows gaps in log | Logs are contiguous, no gaps |
| Implementation | Many subtle variations | More prescriptive specification |
| Performance | Theoretical optimum | Practical and efficient |
| Adoption | Chubby, Spanner | etcd, Consul, CockroachDB |
Real-World Implementations
etcd: Raft for Kubernetes
etcd, the distributed key-value store backing Kubernetes, uses Raft for consensus.
// Simplified etcd Raft usage
import "go.etcd.io/etcd/raft/v3"
type EtcdNode struct
func ()
func (data []byte)
CockroachDB: Multi-Raft
CockroachDB uses a "multi-Raft" approach, running one Raft group per range of keys.
-- In CockroachDB, consensus happens per range
-- Each range has its own Raft group
SHOW RANGES FROM TABLE users;
-- Output shows multiple ranges, each with Raft replicas:
-- range_id | start_key | end_key | replicas
-- 1 | /Min | /Table/50 | {1,2,3}
-- 2 | /Table/50 | /Table/51 | {2,3,4}
Beyond Raft: Modern Variants
Flexible Paxos
Recent research shows that Paxos doesn't require majorities in both phases—only the intersection must form a majority. This "Flexible Paxos" can improve availability.
=
# Phase 1 quorum: 2 nodes
# Phase 2 quorum: 3 nodes
# Intersection: at least 1 node
# 2 + 3 > 4, so guaranteed intersection
= 2
= 3
return
EPaxos: Leaderless Consensus
Egalitarian Paxos (EPaxos) eliminates the leader bottleneck by allowing any replica to act as leader for its commands.
Viewstamped Replication
VR, developed in the 1980s (before Paxos was published), uses a similar approach to Raft but was less well known.
The Philosophy of Consensus
Beyond the algorithms, consensus raises profound questions:
What is truth in a distributed system? In a centralized system, the database state is the truth. In a distributed system, truth is what the majority agrees upon. This is a philosophical shift from absolute to consensus-based truth.
The cost of agreement: Consensus requires communication, and communication requires time. The speed of light imposes a lower bound on consensus latency. We cannot escape physics.
Byzantine failures: All algorithms discussed assume non-Byzantine failures—nodes may crash but won't lie. Byzantine fault tolerance (BFT) algorithms like PBFT handle malicious nodes but with higher overhead.
// The fundamental tension in distributed systems
const = ;
Practical Considerations
When to Use Consensus
Consensus is necessary when:
- Strong consistency is required (financial transactions, distributed locks)
- Coordination is needed (leader election, configuration management)
- Ordering matters (event sequencing, log replication)
Consensus is overkill when:
- Eventual consistency suffices (DNS, caches, analytics)
- Commutativity allows for conflict-free replication (CRDTs)
- Single writer pattern can be used
Performance Implications
# Consensus latency components
# Network RTT to majority of nodes
= 50 # ms, depends on geography
# Disk write latency (fsync for durability)
= 10 # ms, depends on storage
# Processing time
= 1 # ms
# Minimum latency for one round
return + +
# For cross-datacenter deployment:
# 3 datacenters, 5 nodes total
# RTT between DCs: 100ms
# Total latency: ~110ms per consensus round
# For single datacenter:
# RTT: 1ms
# Total latency: ~12ms per consensus round
Conclusion: The Price of Agreement
Consensus algorithms represent one of distributed systems' greatest achievements. From Paxos's theoretical elegance to Raft's practical clarity, these algorithms enable the reliable distributed systems we depend on daily.
Yet consensus is not free. It requires:
- Time: Multiple network round trips
- Space: Replicated storage across nodes
- Complexity: Careful implementation and testing
The choice to use consensus is a choice to prioritize correctness over latency, safety over availability. In a world of distributed systems, this choice is often the right one.
As Leslie Lamport noted, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." Consensus algorithms are our defense against such chaos—they transform unreliable components into reliable systems, probabilistic networks into deterministic guarantees.
The journey from Paxos to Raft to modern variants continues. Each iteration brings us closer to the ideal: consensus algorithms that are correct, efficient, and—perhaps most importantly—understandable by the engineers who must implement and maintain them.
The quest for distributed consensus mirrors a deeper human need: in a world of uncertainty, we seek agreement. In distributed systems, as in life, consensus is how we transform individual knowledge into collective truth.