Soffio

Distributed consensus evolved from Paxos's mathematical rigor to Raft's design-for-understandability philosophy. Paxos, while theoretically elegant with its two-phase prepare/promise and accept/accepted protocol, has a notorious reputation for being difficult to understand and implement correctly. Raft explicitly prioritizes understandability by decomposing consensus into leader election, log replication, and safety properties, with a strong leader model that simplifies reasoning. Both algorithms tolerate (n-1)/2 failures and achieve equivalent theoretical guarantees, but Raft's prescriptive approach has led to more correct implementations and faster developer onboarding. Modern variants like Flexible Paxos and Multi-Raft address specific performance needs, while leaderless approaches like EPaxos remove bottlenecks. The key insight: an understandable algorithm that teams can implement correctly is more valuable than a theoretically pure but opaque one. The best consensus algorithm isn't determined by mathematical properties alone, but by the intersection of correctness, understandability, and practical implementation concerns.

The Art of Consensus: From Paxos to Raft and Beyond

Distributed nodes reaching consensus

In distributed systems, achieving agreement among multiple nodes is deceptively difficult. When networks partition, nodes fail, and messages get delayed or lost, how can we ensure all nodes agree on a single value? This is the consensus problem—one of the most fundamental challenges in distributed computing.

The journey from Paxos to Raft represents not just technical evolution, but a shift in how we think about algorithm design: Is it more important to be mathematically rigorous or to be understandable?

The Consensus Problem: Why Is It So Hard?

The FLP Impossibility Theorem

In 1985, Fischer, Lynch, and Paterson proved a shocking result: in an asynchronous distributed system, consensus is impossible if even a single node can fail.

This doesn't mean consensus is impossible in practice—it means we must make trade-offs:

  • Synchrony assumptions: Bounded message delays
  • Failure models: Only certain types of failures
  • Liveness vs. Safety: Can't guarantee both always

FLP Impossibility visualization

The CAP Theorem

Another fundamental constraint: you can't have all three of:

  • Consistency: All nodes see the same data
  • Availability: Every request receives a response
  • Partition tolerance: System continues despite network splits

Consensus algorithms choose CP (Consistency + Partition tolerance) over availability during partitions.

Paxos: The Elegant Monster

History and Impact

Leslie Lamport published Paxos in 1998 (though it was rejected in 1989). Despite its mathematical beauty, Paxos has a reputation for being notoriously difficult to understand and implement.

Fun fact: Lamport originally presented it as a story about a legislative consensus system on a fictional Greek island, which confused reviewers so much they rejected it!

The Core Algorithm

Paxos operates in two phases:

Phase 1: Prepare/Promise

  • Proposer sends PREPARE(n) with proposal number n
  • Acceptors promise not to accept proposals < n
  • Acceptors return any already-accepted values

Phase 2: Accept/Accepted

  • Proposer sends ACCEPT(n, v) with value v
  • Acceptors accept if they haven't promised to ignore n
  • Once majority accepts, value is chosen
from dataclasses import dataclass
from typing import Optional, Set, Dict
from enum import Enum

@dataclass
class ProposalNumber:
    """Proposal number: (round, server_id) for uniqueness and total ordering"""
    round: int
    server_id: int
    
    def __lt__(self, other):
        return (self.round, self.server_id) < (other.round, other.server_id)
    
    def __eq__(self, other):
        return (self.round, self.server_id) == (other.round, other.server_id)

@dataclass
class Proposal:
    number: ProposalNumber
    value: any

class PaxosAcceptor:
    """Paxos Acceptor node"""
    
    def __init__(self, node_id: int):
        self.node_id = node_id
        # Highest proposal number seen in Phase 1
        self.promised_num: Optional[ProposalNumber] = None
        # Highest proposal actually accepted
        self.accepted_proposal: Optional[Proposal] = None
    
    def prepare(self, proposal_num: ProposalNumber) -> dict:
        """Phase 1: Respond to PREPARE message"""
        
        # If we've already promised a higher number, reject
        if self.promised_num and proposal_num < self.promised_num:
            return {
                'type': 'PROMISE_REJECT',
                'promised': self.promised_num
            }
        
        # Promise not to accept lower-numbered proposals
        self.promised_num = proposal_num
        
        return {
            'type': 'PROMISE',
            'promised': proposal_num,
            'accepted': self.accepted_proposal  # May be None
        }
    
    def accept(self, proposal: Proposal) -> dict:
        """Phase 2: Respond to ACCEPT message"""
        
        # If we've promised not to accept this proposal number, reject
        if self.promised_num and proposal.number < self.promised_num:
            return {
                'type': 'ACCEPT_REJECT',
                'promised': self.promised_num
            }
        
        # Accept the proposal
        self.accepted_proposal = proposal
        self.promised_num = proposal.number
        
        return {
            'type': 'ACCEPTED',
            'proposal': proposal
        }

class PaxosProposer:
    """Paxos Proposer node"""
    
    def __init__(self, server_id: int, acceptors: list):
        self.server_id = server_id
        self.acceptors = acceptors
        self.current_round = 0
        self.majority = len(acceptors) // 2 + 1
    
    def propose(self, value: any) -> Optional[any]:
        """
        Propose a value using Paxos algorithm.
        Returns the chosen value (may differ from proposed value).
        """
        
        # Phase 1: Prepare
        self.current_round += 1
        proposal_num = ProposalNumber(self.current_round, self.server_id)
        
        print(f"[Proposer {self.server_id}] Phase 1: PREPARE({proposal_num.round})")
        
        promises = []
        for acceptor in self.acceptors:
            response = acceptor.prepare(proposal_num)
            if response['type'] == 'PROMISE':
                promises.append(response)
        
        # Need majority of promises
        if len(promises) < self.majority:
            print(f"[Proposer {self.server_id}] Failed to get majority in Phase 1")
            return None
        
        print(f"[Proposer {self.server_id}] Got {len(promises)} promises")
        
        # Check if any acceptor already accepted a value
        accepted_proposals = [p['accepted'] for p in promises if p['accepted']]
        
        if accepted_proposals:
            # Use the value from the highest-numbered accepted proposal
            highest = max(accepted_proposals, key=lambda p: p.number)
            value = highest.value
            print(f"[Proposer {self.server_id}] Using previously accepted value: {value}")
        
        # Phase 2: Accept
        proposal = Proposal(proposal_num, value)
        print(f"[Proposer {self.server_id}] Phase 2: ACCEPT({proposal.number.round}, {value})")
        
        accepts = []
        for acceptor in self.acceptors:
            response = acceptor.accept(proposal)
            if response['type'] == 'ACCEPTED':
                accepts.append(response)
        
        # Need majority of accepts
        if len(accepts) < self.majority:
            print(f"[Proposer {self.server_id}] Failed to get majority in Phase 2")
            return None
        
        print(f"[Proposer {self.server_id}] Value chosen: {value}")
        return value

# Example usage
acceptors = [PaxosAcceptor(i) for i in range(5)]
proposer1 = PaxosProposer(1, acceptors)
proposer2 = PaxosProposer(2, acceptors)

# Proposer 1 proposes "A"
chosen = proposer1.propose("A")
print(f"\nChosen value: {chosen}")

# Proposer 2 tries to propose "B" but will get "A" (already chosen)
chosen = proposer2.propose("B")
print(f"\nChosen value: {chosen}")

Paxos algorithm flow

Multi-Paxos: Achieving Consensus on a Log

Basic Paxos achieves consensus on a single value. Real systems need consensus on a sequence of values (a log). Multi-Paxos optimizes this:

class MultiPaxos:
    """Simplified Multi-Paxos for log consensus"""
    
    def __init__(self, server_id: int, acceptors: list):
        self.server_id = server_id
        self.acceptors = acceptors
        self.log: list = []  # Decided log entries
        self.leader_id: Optional[int] = None
        self.is_leader = False
    
    def become_leader(self):
        """Leader election (simplified)"""
        # In real Multi-Paxos, this uses Phase 1 of Paxos
        self.is_leader = True
        self.leader_id = self.server_id
        print(f"[Server {self.server_id}] Became leader")
    
    def append(self, value: any) -> bool:
        """Append a value to the log (only leader can do this)"""
        if not self.is_leader:
            print(f"[Server {self.server_id}] Not leader, cannot append")
            return False
        
        # Skip Phase 1 (already done during leader election)
        # Go directly to Phase 2
        index = len(self.log)
        proposal = Proposal(
            ProposalNumber(0, self.server_id),  # Leader uses same proposal number
            value
        )
        
        # Send ACCEPT to all acceptors
        accepts = 0
        for acceptor in self.acceptors:
            response = acceptor.accept(proposal)
            if response['type'] == 'ACCEPTED':
                accepts += 1
        
        majority = len(self.acceptors) // 2 + 1
        if accepts >= majority:
            self.log.append(value)
            print(f"[Server {self.server_id}] Appended to log[{index}]: {value}")
            return True
        
        return False

Why Paxos Is Hard

The problems with Paxos:

  1. Conceptual complexity: Two-phase protocol with subtle edge cases
  2. Implementation difficulty: Gap between algorithm and real systems
  3. No clear leader: Multiple proposers can conflict
  4. Configuration changes: Adding/removing nodes is tricky

As Mike Burrows (Google Chubby) said: "There are only two kinds of consensus algorithms: Paxos and wrong ones. But Paxos is too hard to understand."

Raft: Understandability as a Goal

Design Philosophy

In 2014, Diego Ongaro and John Ousterhout published Raft with an explicit goal: understandability. They decomposed consensus into:

  1. Leader election: One server becomes leader
  2. Log replication: Leader replicates entries to followers
  3. Safety: Ensure consistency guarantees

Raft decomposition

Leader Election

Raft servers can be in three states:

from enum import Enum
import time
import random

class ServerState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    term: int
    index: int
    command: any

class RaftNode:
    """Raft consensus node"""
    
    def __init__(self, node_id: int, peers: list):
        self.node_id = node_id
        self.peers = peers  # Other nodes
        
        # Persistent state
        self.current_term = 0
        self.voted_for: Optional[int] = None
        self.log: list[LogEntry] = []
        
        # Volatile state
        self.state = ServerState.FOLLOWER
        self.commit_index = 0
        self.last_applied = 0
        
        # Leader state
        self.next_index: Dict[int, int] = {}
        self.match_index: Dict[int, int] = {}
        
        # Timing
        self.last_heartbeat = time.time()
        self.election_timeout = self._random_timeout()
    
    def _random_timeout(self) -> float:
        """Random election timeout between 150-300ms"""
        return random.uniform(0.15, 0.30)
    
    def tick(self):
        """Called periodically to check timeouts"""
        
        if self.state == ServerState.LEADER:
            # Send heartbeats
            self._send_heartbeats()
        
        elif self.state == ServerState.FOLLOWER:
            # Check election timeout
            if time.time() - self.last_heartbeat > self.election_timeout:
                self._start_election()
        
        elif self.state == ServerState.CANDIDATE:
            # Check election timeout
            if time.time() - self.last_heartbeat > self.election_timeout:
                self._start_election()
    
    def _start_election(self):
        """Start leader election"""
        self.state = ServerState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        self.last_heartbeat = time.time()
        self.election_timeout = self._random_timeout()
        
        print(f"[Node {self.node_id}] Starting election for term {self.current_term}")
        
        # Request votes from all peers
        votes = 1  # Vote for self
        
        for peer in self.peers:
            response = peer.request_vote(
                term=self.current_term,
                candidate_id=self.node_id,
                last_log_index=len(self.log) - 1 if self.log else -1,
                last_log_term=self.log[-1].term if self.log else 0
            )
            
            if response['vote_granted']:
                votes += 1
        
        # Check if won election
        majority = (len(self.peers) + 1) // 2 + 1
        if votes >= majority:
            self._become_leader()
    
    def _become_leader(self):
        """Transition to leader state"""
        print(f"[Node {self.node_id}] Became leader for term {self.current_term}")
        self.state = ServerState.LEADER
        
        # Initialize leader state
        for peer_id in [p.node_id for p in self.peers]:
            self.next_index[peer_id] = len(self.log)
            self.match_index[peer_id] = -1
        
        # Send initial heartbeats
        self._send_heartbeats()
    
    def request_vote(self, term: int, candidate_id: int, 
                    last_log_index: int, last_log_term: int) -> dict:
        """Handle RequestVote RPC"""
        
        # If term is old, reject
        if term < self.current_term:
            return {'term': self.current_term, 'vote_granted': False}
        
        # If term is newer, update and step down
        if term > self.current_term:
            self.current_term = term
            self.state = ServerState.FOLLOWER
            self.voted_for = None
        
        # Check if we can vote for this candidate
        can_vote = (
            self.voted_for is None or self.voted_for == candidate_id
        ) and self._is_log_up_to_date(last_log_index, last_log_term)
        
        if can_vote:
            self.voted_for = candidate_id
            self.last_heartbeat = time.time()
            print(f"[Node {self.node_id}] Voted for {candidate_id} in term {term}")
        
        return {'term': self.current_term, 'vote_granted': can_vote}
    
    def _is_log_up_to_date(self, last_log_index: int, last_log_term: int) -> bool:
        """Check if candidate's log is at least as up-to-date as ours"""
        if not self.log:
            return True
        
        our_last_term = self.log[-1].term
        our_last_index = len(self.log) - 1
        
        if last_log_term != our_last_term:
            return last_log_term > our_last_term
        
        return last_log_index >= our_last_index
    
    def _send_heartbeats(self):
        """Leader sends periodic heartbeats (empty AppendEntries)"""
        for peer in self.peers:
            peer.append_entries(
                term=self.current_term,
                leader_id=self.node_id,
                prev_log_index=-1,
                prev_log_term=0,
                entries=[],
                leader_commit=self.commit_index
            )
    
    def append_entries(self, term: int, leader_id: int,
                      prev_log_index: int, prev_log_term: int,
                      entries: list, leader_commit: int) -> dict:
        """Handle AppendEntries RPC (also serves as heartbeat)"""
        
        # Reset election timeout
        self.last_heartbeat = time.time()
        
        # If term is old, reject
        if term < self.current_term:
            return {'term': self.current_term, 'success': False}
        
        # If term is newer, update and step down
        if term > self.current_term:
            self.current_term = term
            self.state = ServerState.FOLLOWER
            self.voted_for = None
        
        # This is the current leader
        if self.state == ServerState.CANDIDATE:
            self.state = ServerState.FOLLOWER
        
        return {'term': self.current_term, 'success': True}

Raft leader election

Log Replication

Once elected, the leader:

  1. Receives commands from clients
  2. Appends to its log
  3. Replicates to followers
  4. Commits once majority have replicated
class RaftNode:
    # ... (previous code)
    
    def client_request(self, command: any) -> bool:
        """Handle client request (only leader processes these)"""
        if self.state != ServerState.LEADER:
            return False
        
        # Append to local log
        entry = LogEntry(
            term=self.current_term,
            index=len(self.log),
            command=command
        )
        self.log.append(entry)
        
        print(f"[Leader {self.node_id}] Appended entry {entry.index}: {command}")
        
        # Replicate to followers
        self._replicate_log()
        
        return True
    
    def _replicate_log(self):
        """Replicate log entries to all followers"""
        for peer in self.peers:
            next_idx = self.next_index[peer.node_id]
            
            # Entries to send
            entries = self.log[next_idx:]
            
            # Previous log entry for consistency check
            prev_log_index = next_idx - 1
            prev_log_term = self.log[prev_log_index].term if prev_log_index >= 0 else 0
            
            response = peer.append_entries(
                term=self.current_term,
                leader_id=self.node_id,
                prev_log_index=prev_log_index,
                prev_log_term=prev_log_term,
                entries=entries,
                leader_commit=self.commit_index
            )
            
            if response['success']:
                # Update indices
                if entries:
                    self.next_index[peer.node_id] = entries[-1].index + 1
                    self.match_index[peer.node_id] = entries[-1].index
            else:
                # Follower's log inconsistent, decrement nextIndex and retry
                self.next_index[peer.node_id] = max(0, next_idx - 1)
        
        # Check if we can commit any entries
        self._update_commit_index()
    
    def _update_commit_index(self):
        """Update commit index if majority have replicated"""
        for n in range(self.commit_index + 1, len(self.log)):
            # Count how many servers have replicated entry n
            replicated = 1  # Leader has it
            
            for peer_id, match_idx in self.match_index.items():
                if match_idx >= n:
                    replicated += 1
            
            majority = (len(self.peers) + 1) // 2 + 1
            
            # If majority have replicated and entry is from current term
            if replicated >= majority and self.log[n].term == self.current_term:
                self.commit_index = n
                print(f"[Leader {self.node_id}] Committed up to index {n}")

Safety Properties

Raft guarantees several safety properties:

Election Safety: At most one leader per term

Leader Append-Only: Leader never overwrites or deletes entries

Log Matching: If two logs contain an entry with same index and term, they are identical up to that index

Leader Completeness: If a log entry is committed, it will be present in all future leaders

State Machine Safety: If a server has applied a log entry at a given index, no other server will apply a different entry at that index

Raft safety properties

Raft vs Paxos: The Great Debate

Similarities

Both algorithms:

  • Achieve consensus in asynchronous environments
  • Tolerate (n-1)/2 failures in a cluster of n nodes
  • Are theoretically equivalent in what they can achieve

Differences

Aspect Paxos Raft
Understandability Notoriously difficult Designed for understandability
Leader Weak or no leader Strong leader
Log structure Can have gaps Sequential, no gaps
Configuration changes Complex Explicit membership change
Implementation Many variations Single, well-defined algorithm

The Understandability Revolution

Raft's understandability has measurable impact:

  • Faster onboarding for new engineers
  • Fewer implementation bugs
  • Better mental models for debugging
  • More implementations (etcd, Consul, CockroachDB, etc.)

Key insight: An algorithm that humans can understand and implement correctly is more valuable than a theoretically pure but opaque one.

Paxos vs Raft comparison

Beyond Raft: Modern Consensus

Flexible Paxos

Allows different quorum sizes for prepare and accept phases:

class FlexiblePaxos:
    """
    Flexible Paxos: Phase 1 and Phase 2 can use different quorum sizes
    As long as Q1 + Q2 > N, safety is preserved
    """
    
    def __init__(self, n_nodes: int, q1_size: int, q2_size: int):
        assert q1_size + q2_size > n_nodes, "Quorums must overlap"
        
        self.n_nodes = n_nodes
        self.q1_size = q1_size  # Phase 1 quorum
        self.q2_size = q2_size  # Phase 2 quorum
    
    # Example: In 5-node cluster
    # Traditional: Q1=3, Q2=3
    # Flexible: Q1=2, Q2=4 (faster reads, slower writes)
    # Flexible: Q1=4, Q2=2 (faster writes, slower reads)

Multi-Raft and Sharding

Large-scale systems use multiple Raft groups:

class MultiRaft:
    """Multiple Raft groups for horizontal scaling"""
    
    def __init__(self, num_shards: int):
        self.raft_groups: Dict[int, RaftNode] = {}
        
        for shard_id in range(num_shards):
            # Each shard has its own Raft group
            self.raft_groups[shard_id] = RaftNode(shard_id, peers=[])
    
    def write(self, key: str, value: any):
        """Route write to appropriate shard"""
        shard_id = hash(key) % len(self.raft_groups)
        return self.raft_groups[shard_id].client_request(('SET', key, value))
    
    def read(self, key: str):
        """Route read to appropriate shard"""
        shard_id = hash(key) % len(self.raft_groups)
        # Read from leader or with read quorum
        return self._read_from_shard(shard_id, key)

Leaderless Consensus: EPaxos

Egalitarian Paxos removes the leader bottleneck:

  • Any node can propose
  • Commands are partially ordered (not totally ordered)
  • Better load distribution
  • More complex conflict resolution

Practical Considerations

Performance Optimizations

class OptimizedRaft(RaftNode):
    """Raft with practical optimizations"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
        # Pipeline: send multiple AppendEntries in parallel
        self.pipeline_depth = 10
        
        # Batching: group multiple commands
        self.batch_size = 100
        self.pending_commands = []
        
        # Read optimization: lease-based reads
        self.lease_expiry = 0
    
    def read_with_lease(self, key: str):
        """Fast read without consensus (leader lease)"""
        if self.state == ServerState.LEADER and time.time() < self.lease_expiry:
            # Can read locally without consensus
            return self._local_read(key)
        
        # Fall back to linearizable read
        return self._linearizable_read(key)
    
    def _linearizable_read(self, key: str):
        """Read that requires consensus (slower but safe)"""
        # Append a no-op entry and wait for commit
        # This ensures we see all committed writes
        pass

Configuration Changes

Safely adding/removing nodes:

class RaftWithConfigChange(RaftNode):
    """Raft with safe configuration changes"""
    
    def add_node(self, new_node_id: int):
        """Safely add a new node to the cluster"""
        
        # Joint consensus approach (Raft paper Section 6)
        # 1. Append C_old,new to log
        old_config = self.get_current_config()
        joint_config = old_config.union({new_node_id})
        
        self.client_request(('CONFIG_CHANGE', joint_config))
        
        # 2. Wait for C_old,new to commit
        # During this time, decisions require majority in BOTH configs
        
        # 3. Append C_new to log
        self.client_request(('CONFIG_CHANGE', {new_node_id}))
        
        # 4. Wait for C_new to commit

Raft configuration change

Real-World Implementations

etcd (Raft)

Used by Kubernetes for cluster coordination:

# etcd uses Raft for consistency
$ etcdctl put mykey "hello world"
OK

$ etcdctl get mykey
mykey
hello world

# Member list shows Raft cluster
$ etcdctl member list
8e9e05c52164694d: name=node1 peerURLs=http://localhost:2380 clientURLs=http://localhost:2379 isLeader=true

Google Spanner (Paxos)

Global-scale database using Paxos groups:

  • Multiple Paxos groups (one per shard)
  • TrueTime for global consistency
  • Cross-datacenter replication

CockroachDB (Raft)

Distributed SQL database:

  • Multi-Raft for horizontal scaling
  • Ranges (data shards) each have a Raft group
  • Automatic rebalancing

Conclusion: The Philosophy of Consensus

The evolution from Paxos to Raft teaches us profound lessons:

1. Correctness Isn't Enough

Paxos is provably correct, but that doesn't make it usable. Understandability is a first-class concern.

2. Decomposition Matters

Raft's decomposition into leader election, log replication, and safety made a complex problem tractable.

3. Strong Leadership Simplifies

A strong leader (Raft) is easier to reason about than weak/no leader (Paxos), even if theoretically equivalent.

4. Trade-offs Are Fundamental

All consensus algorithms trade off:

  • Latency vs. Fault Tolerance
  • Throughput vs. Consistency
  • Simplicity vs. Flexibility

5. Implementation Matters

The gap between algorithm and implementation is where theory meets practice. Raft's prescriptive approach reduces this gap.

Future of consensus

The ultimate insight: Consensus algorithms aren't just about getting machines to agree—they're about getting humans to agree on how machines should agree.

Paxos and Raft represent two philosophies:

  • Paxos: Mathematical elegance and flexibility
  • Raft: Pragmatic understandability and implementation

Both have their place. The choice depends on your priorities: Do you optimize for theoretical properties or human comprehension?

In the end, the best algorithm is the one your team can implement correctly and maintain confidently.


What consensus algorithm does your system use? Have you implemented Raft or Paxos? Share your experiences in the comments.