Consensus in Distributed Systems: From Paxos to Raft

Distributed Consensus

The Fundamental Problem

In a world where single points of failure are unacceptable, distributed systems reign supreme. Yet with distribution comes a profound challenge: how do independent nodes agree on a single truth? This is the consensus problem, and it lies at the heart of every reliable distributed system—from databases to blockchains, from cloud orchestration to distributed locks.

The consensus problem asks: given a collection of processes that can fail and communicate over an unreliable network, how can they agree on a single value? This seemingly simple question has profound implications for consistency, availability, and partition tolerance.

The CAP Theorem and Consensus

Before diving into consensus algorithms, we must understand the constraints. The CAP theorem tells us that in the presence of network partitions, we must choose between consistency and availability. Consensus algorithms are fundamentally about maintaining consistency—ensuring all nodes agree on the same sequence of operations.

# The consensus problem in pseudo-code
class ConsensusNode:
    def __init__(self, node_id, peers):
        self.node_id = node_id
        self.peers = peers
        self.accepted_value = None
        
    def propose(self, value):
        """
        Propose a value for consensus.
        Challenge: How do we ensure all nodes accept the same value,
        even in the face of failures and network partitions?
        """
        pass
        
    def learn(self, value):
        """
        Learn the chosen value.
        All correct nodes must learn the same value.
        """
        self.accepted_value = value

Paxos: The Theoretical Foundation

Leslie Lamport's Paxos algorithm, introduced in 1989 (though not published until 1998), is the theoretical foundation of distributed consensus. Paxos is famously difficult to understand, leading Lamport to write multiple explanations, including the whimsical "The Part-Time Parliament."

The Paxos Protocol

Paxos operates with three roles:

Proposers: Propose values for consensus
Acceptors: Vote on proposed values
Learners: Learn the chosen value

The protocol proceeds in two phases:

Phase 1: Prepare

// Paxos Phase 1: Prepare
type Proposer struct {
    proposalNumber int
    acceptors      []Acceptor
}

func (p *Proposer) Prepare() PrepareResponse {
    p.proposalNumber++
    
    // Send prepare(n) to majority of acceptors
    responses := make(chan PrepareResponse, len(p.acceptors))
    
    for _, acceptor := range p.acceptors {
        go func(a Acceptor) {
            responses <- a.ReceivePrepare(p.proposalNumber)
        }(acceptor)
    }
    
    // Wait for majority
    majority := len(p.acceptors)/2 + 1
    promises := []PrepareResponse{}
    
    for i := 0; i < majority; i++ {
        promises = append(promises, <-responses)
    }
    
    return p.findHighestAccepted(promises)
}

type Acceptor struct {
    promisedNumber  int
    acceptedNumber  int
    acceptedValue   interface{}
}

func (a *Acceptor) ReceivePrepare(n int) PrepareResponse {
    if n > a.promisedNumber {
        a.promisedNumber = n
        return PrepareResponse{
            Promise:        true,
            AcceptedNumber: a.acceptedNumber,
            AcceptedValue:  a.acceptedValue,
        }
    }
    return PrepareResponse{Promise: false}
}

Phase 2: Accept

// Paxos Phase 2: Accept
func (p *Proposer) Accept(value interface{}) bool {
    // Send accept(n, value) to majority of acceptors
    responses := make(chan bool, len(p.acceptors))
    
    for _, acceptor := range p.acceptors {
        go func(a Acceptor) {
            responses <- a.ReceiveAccept(p.proposalNumber, value)
        }(acceptor)
    }
    
    // Wait for majority
    majority := len(p.acceptors)/2 + 1
    accepted := 0
    
    for i := 0; i < majority; i++ {
        if <-responses {
            accepted++
        }
    }
    
    return accepted >= majority
}

func (a *Acceptor) ReceiveAccept(n int, value interface{}) bool {
    if n >= a.promisedNumber {
        a.promisedNumber = n
        a.acceptedNumber = n
        a.acceptedValue = value
        return true
    }
    return false
}

The Complexity of Paxos

Paxos provides strong safety guarantees:

Validity: Only proposed values can be chosen
Agreement: At most one value is chosen
Termination: Eventually, a value is chosen (under reasonable assumptions)

However, Paxos is notoriously difficult to implement correctly. The protocol handles multiple proposers competing simultaneously, message reordering, and partial failures. This complexity led to many incorrect implementations and the search for alternatives.

Paxos Message Flow

Multi-Paxos: Toward Practical Systems

Basic Paxos achieves consensus on a single value. Real systems need to agree on a sequence of values—a replicated log. Multi-Paxos extends basic Paxos by electing a stable leader who can skip Phase 1 for subsequent proposals.

// Multi-Paxos with leader election
struct MultiPaxos {
    current_leader: NodeId,
    log: Vec<LogEntry>,
    commit_index: usize,
}

impl MultiPaxos {
    fn append_entry(&mut self, value: Value) -> Result<(), Error> {
        if self.is_leader() {
            // Leader can skip Phase 1 (prepare)
            let index = self.log.len();
            let entry = LogEntry {
                term: self.current_term,
                index,
                value,
            };
            
            // Directly send accept to followers
            self.replicate_to_followers(entry)?;
            self.log.push(entry);
            Ok(())
        } else {
            // Redirect to leader
            Err(Error::NotLeader(self.current_leader))
        }
    }
    
    fn replicate_to_followers(&self, entry: LogEntry) -> Result<(), Error> {
        // Send to majority of followers
        // Wait for acknowledgment
        // Update commit index
        todo!()
    }
}

Raft: Understandability as a Design Goal

In 2013, Diego Ongaro and John Ousterhout introduced Raft with an explicit goal: understandability. They recognized that consensus algorithms are implemented by humans, and human understanding is crucial for correctness.

Raft decomposes consensus into three independent subproblems:

Leader Election: How to select a leader when one fails
Log Replication: How the leader replicates its log to followers
Safety: How to ensure consistency across failures

Leader Election

Raft nodes exist in three states: follower, candidate, or leader.

enum NodeState {
  Follower,
  Candidate,
  Leader
}

class RaftNode {
  private state: NodeState = NodeState.Follower;
  private currentTerm: number = 0;
  private votedFor: string | null = null;
  private log: LogEntry[] = [];
  private commitIndex: number = 0;
  private lastApplied: number = 0;
  
  // Election timeout triggers leader election
  private electionTimeout: number = randomRange(150, 300);
  
  startElection(): void {
    // Transition to candidate
    this.state = NodeState.Candidate;
    this.currentTerm++;
    this.votedFor = this.nodeId;
    
    let votesReceived = 1; // Vote for self
    
    // Request votes from all peers
    for (const peer of this.peers) {
      const response = peer.requestVote({
        term: this.currentTerm,
        candidateId: this.nodeId,
        lastLogIndex: this.log.length - 1,
        lastLogTerm: this.log[this.log.length - 1]?.term || 0
      });
      
      if (response.voteGranted) {
        votesReceived++;
      }
      
      // Check if we have majority
      if (votesReceived > this.peers.length / 2) {
        this.becomeLeader();
        return;
      }
    }
    
    // Election failed, restart timeout
    this.resetElectionTimeout();
  }
  
  requestVote(request: VoteRequest): VoteResponse {
    // Reply false if term < currentTerm
    if (request.term < this.currentTerm) {
      return { term: this.currentTerm, voteGranted: false };
    }
    
    // If votedFor is null or candidateId, and candidate's log is
    // at least as up-to-date as receiver's log, grant vote
    if ((this.votedFor === null || this.votedFor === request.candidateId) &&
        this.isLogUpToDate(request.lastLogIndex, request.lastLogTerm)) {
      this.votedFor = request.candidateId;
      this.resetElectionTimeout();
      return { term: this.currentTerm, voteGranted: true };
    }
    
    return { term: this.currentTerm, voteGranted: false };
  }
}

Log Replication

Once elected, the leader handles all client requests and replicates its log to followers.

class RaftNode {
  // Leader state
  private nextIndex: Map<string, number> = new Map();
  private matchIndex: Map<string, number> = new Map();
  
  appendEntries(request: AppendEntriesRequest): AppendEntriesResponse {
    // Reply false if term < currentTerm
    if (request.term < this.currentTerm) {
      return { term: this.currentTerm, success: false };
    }
    
    // Reset election timeout - we heard from leader
    this.resetElectionTimeout();
    
    // Reply false if log doesn't contain an entry at prevLogIndex
    // whose term matches prevLogTerm
    if (request.prevLogIndex >= 0) {
      const entry = this.log[request.prevLogIndex];
      if (!entry || entry.term !== request.prevLogTerm) {
        return { term: this.currentTerm, success: false };
      }
    }
    
    // If an existing entry conflicts with a new one,
    // delete the existing entry and all that follow it
    for (let i = 0; i < request.entries.length; i++) {
      const index = request.prevLogIndex + 1 + i;
      const newEntry = request.entries[i];
      
      if (this.log[index] && this.log[index].term !== newEntry.term) {
        this.log = this.log.slice(0, index);
      }
      
      if (index >= this.log.length) {
        this.log.push(newEntry);
      }
    }
    
    // Update commit index
    if (request.leaderCommit > this.commitIndex) {
      this.commitIndex = Math.min(
        request.leaderCommit,
        this.log.length - 1
      );
    }
    
    return { term: this.currentTerm, success: true };
  }
  
  replicateLog(): void {
    if (this.state !== NodeState.Leader) return;
    
    for (const peer of this.peers) {
      const nextIdx = this.nextIndex.get(peer.id) || 0;
      const prevLogIndex = nextIdx - 1;
      const prevLogTerm = prevLogIndex >= 0 
        ? this.log[prevLogIndex].term 
        : 0;
      
      const response = peer.appendEntries({
        term: this.currentTerm,
        leaderId: this.nodeId,
        prevLogIndex,
        prevLogTerm,
        entries: this.log.slice(nextIdx),
        leaderCommit: this.commitIndex
      });
      
      if (response.success) {
        this.nextIndex.set(peer.id, this.log.length);
        this.matchIndex.set(peer.id, this.log.length - 1);
        this.updateCommitIndex();
      } else if (response.term > this.currentTerm) {
        // Discovered higher term, step down
        this.becomeFollower(response.term);
      } else {
        // Decrement nextIndex and retry
        this.nextIndex.set(peer.id, nextIdx - 1);
      }
    }
  }
}

Raft Consensus

Safety Properties

Raft's safety guarantees ensure that:

Election Safety: At most one leader per term
Leader Append-Only: Leaders never overwrite or delete entries
Log Matching: If two logs contain an entry with the same index and term, all preceding entries are identical
Leader Completeness: If a log entry is committed in a given term, it will be present in the leaders' logs for all higher terms
State Machine Safety: If a server has applied a log entry at a given index, no other server will apply a different entry for that index

Comparing Paxos and Raft

Aspect	Paxos	Raft
Understandability	Complex, multiple explanations exist	Designed for understandability
Leader	Multiple proposers can compete	Single leader, stronger leadership
Log structure	Allows gaps in log	Logs are contiguous, no gaps
Implementation	Many subtle variations	More prescriptive specification
Performance	Theoretical optimum	Practical and efficient
Adoption	Chubby, Spanner	etcd, Consul, CockroachDB

Real-World Implementations

etcd: Raft for Kubernetes

etcd, the distributed key-value store backing Kubernetes, uses Raft for consensus.

// Simplified etcd Raft usage
import "go.etcd.io/etcd/raft/v3"

type EtcdNode struct {
    raftNode  raft.Node
    storage   *raft.MemoryStorage
    proposeC  chan string
    commitC   chan *commit
}

func (n *EtcdNode) processCommits() {
    for commit := range n.commitC {
        if commit.data != nil {
            // Apply committed entry to state machine
            n.applyToStateMachine(commit.data)
        }
    }
}

func (n *EtcdNode) propose(data []byte) {
    // Propose to Raft
    n.raftNode.Propose(context.TODO(), data)
}

CockroachDB: Multi-Raft

CockroachDB uses a "multi-Raft" approach, running one Raft group per range of keys.

-- In CockroachDB, consensus happens per range
-- Each range has its own Raft group
SHOW RANGES FROM TABLE users;

-- Output shows multiple ranges, each with Raft replicas:
-- range_id | start_key | end_key | replicas
-- 1        | /Min      | /Table/50 | {1,2,3}
-- 2        | /Table/50 | /Table/51 | {2,3,4}

Beyond Raft: Modern Variants

Flexible Paxos

Recent research shows that Paxos doesn't require majorities in both phases—only the intersection must form a majority. This "Flexible Paxos" can improve availability.

class FlexiblePaxos:
    def __init__(self, n_acceptors):
        self.n_acceptors = n_acceptors
        # Phase 1 quorum: 2 nodes
        # Phase 2 quorum: 3 nodes
        # Intersection: at least 1 node
        # 2 + 3 > 4, so guaranteed intersection
        self.phase1_quorum = 2
        self.phase2_quorum = 3
        
    def can_commit(self, phase1_responses, phase2_responses):
        return (len(phase1_responses) >= self.phase1_quorum and
                len(phase2_responses) >= self.phase2_quorum)

EPaxos: Leaderless Consensus

Egalitarian Paxos (EPaxos) eliminates the leader bottleneck by allowing any replica to act as leader for its commands.

Viewstamped Replication

VR, developed in the 1980s (before Paxos was published), uses a similar approach to Raft but was less well known.

The Philosophy of Consensus

Beyond the algorithms, consensus raises profound questions:

What is truth in a distributed system? In a centralized system, the database state is the truth. In a distributed system, truth is what the majority agrees upon. This is a philosophical shift from absolute to consensus-based truth.

The cost of agreement: Consensus requires communication, and communication requires time. The speed of light imposes a lower bound on consensus latency. We cannot escape physics.

Byzantine failures: All algorithms discussed assume non-Byzantine failures—nodes may crash but won't lie. Byzantine fault tolerance (BFT) algorithms like PBFT handle malicious nodes but with higher overhead.

// The fundamental tension in distributed systems
const CAP_THEOREM = {
  choose_two_of_three: ['Consistency', 'Availability', 'Partition Tolerance'],
  
  reality: 'Partitions happen. You must choose between C and A.',
  
  consensus_choice: 'Consensus algorithms choose Consistency over Availability',
  
  philosophical_insight: 
    'Agreement is more valuable than availability when correctness matters.'
};

Consensus Philosophy

Practical Considerations

When to Use Consensus

Consensus is necessary when:

Strong consistency is required (financial transactions, distributed locks)
Coordination is needed (leader election, configuration management)
Ordering matters (event sequencing, log replication)

Consensus is overkill when:

Eventual consistency suffices (DNS, caches, analytics)
Commutativity allows for conflict-free replication (CRDTs)
Single writer pattern can be used

Performance Implications

# Consensus latency components
import time

class ConsensusLatency:
    def estimate_latency(self):
        # Network RTT to majority of nodes
        network_rtt = 50  # ms, depends on geography
        
        # Disk write latency (fsync for durability)
        disk_write = 10  # ms, depends on storage
        
        # Processing time
        processing = 1  # ms
        
        # Minimum latency for one round
        return network_rtt + disk_write + processing
        
# For cross-datacenter deployment:
# 3 datacenters, 5 nodes total
# RTT between DCs: 100ms
# Total latency: ~110ms per consensus round

# For single datacenter:
# RTT: 1ms
# Total latency: ~12ms per consensus round

Conclusion: The Price of Agreement

Consensus algorithms represent one of distributed systems' greatest achievements. From Paxos's theoretical elegance to Raft's practical clarity, these algorithms enable the reliable distributed systems we depend on daily.

Yet consensus is not free. It requires:

Time: Multiple network round trips
Space: Replicated storage across nodes
Complexity: Careful implementation and testing

The choice to use consensus is a choice to prioritize correctness over latency, safety over availability. In a world of distributed systems, this choice is often the right one.

As Leslie Lamport noted, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." Consensus algorithms are our defense against such chaos—they transform unreliable components into reliable systems, probabilistic networks into deterministic guarantees.

The journey from Paxos to Raft to modern variants continues. Each iteration brings us closer to the ideal: consensus algorithms that are correct, efficient, and—perhaps most importantly—understandable by the engineers who must implement and maintain them.

The quest for distributed consensus mirrors a deeper human need: in a world of uncertainty, we seek agreement. In distributed systems, as in life, consensus is how we transform individual knowledge into collective truth.