About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
Every distributed system I have built forced a conversation about consistency before it forced a conversation about performance. Sometimes that conversation happened during design. More often it happened at 3 AM when a customer reported stale data, a checkout double-charged, or a dashboard showed a record that had been deleted twenty seconds earlier. The CAP theorem gets referenced in every system design interview and every architecture document, yet most teams still get caught off guard by what it actually means in production. The theorem itself is simple. Living with its consequences is where the engineering happens.
This is a deep reference on consistency in distributed systems. It covers CAP and its successor PACELC, the full hierarchy of consistency models from linearizability down to eventual consistency, the mechanics of quorums and consensus protocols, conflict resolution strategies including CRDTs, and how all of this maps to the AWS services I use daily. Skip this if you want a textbook summary. What follows is the operational reality.
The CAP Theorem
Origin and Formal Proof
Eric Brewer proposed the CAP conjecture at the ACM Symposium on Principles of Distributed Computing in 2000. Two years later, Seth Gilbert and Nancy Lynch at MIT proved it formally. The theorem states that a distributed data store can provide at most two of three guarantees simultaneously: Consistency, Availability, and Partition Tolerance.
The proof is constructive. Gilbert and Lynch showed that during a network partition, a system must choose: either reject some requests (sacrificing availability) or serve potentially stale data (sacrificing consistency). There is no third option. No clever engineering, no amount of hardware, no exotic protocol eliminates this fundamental constraint.
The Three Properties Defined
These definitions are precise, and the precision matters because casual usage leads to architectural mistakes.
| Property | Formal Definition | Operational Meaning |
|---|---|---|
| Consistency | Every read receives the most recent write or an error | All nodes see the same data at the same time. A read immediately after a write returns the written value, regardless of which node handles the read. |
| Availability | Every request receives a non-error response, without guaranteeing it contains the most recent write | Every functioning node returns a valid response for every request. No timeouts, no errors, but the response might be stale. |
| Partition Tolerance | The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network | The system handles network failures between nodes. Messages between nodes can be lost, delayed, or reordered, and the system keeps running. |
What CAP Actually Says (and What It Does Not)
CAP gets misquoted constantly. The common version goes: "Pick two out of three." That framing is misleading in three ways.
First, partition tolerance is mandatory. Networks fail. Cables get cut. Switches die. Cloud availability zones lose connectivity. Any production distributed system must tolerate partitions. The real choice is between consistency and availability during a partition.
Second, CAP says nothing about what happens when the network is healthy. During normal operations (which is 99.9%+ of the time for most systems), you can have both consistency and availability. The trade-off only materializes during a partition event.
Third, CAP uses a very specific definition of "consistency" (linearizability) and a very specific definition of "availability" (every non-failing node responds). These are extremes. Most real systems operate somewhere in between, which is why the theorem's binary framing frustrates practitioners.
flowchart TD
A[Network Partition
Detected] --> B{System must
choose}
B -->|Preserve Consistency| C[Reject requests to
partitioned nodes]
C --> D[CP System
Some requests fail]
B -->|Preserve Availability| E[Serve requests from
all nodes]
E --> F[AP System
Some reads return
stale data]
D --> G[Examples:
ZooKeeper, etcd,
HBase, Spanner]
F --> H[Examples:
Cassandra, DynamoDB,
CouchDB, Riak] Beyond CAP: The PACELC Extension
Latency as the Real Trade-Off
Daniel Abadi proposed the PACELC theorem in 2010, and it addresses CAP's biggest blind spot: what happens when the network is fine. PACELC states: if there is a Partition, trade off Availability and Consistency; Else, trade off Latency and Consistency.
That "Else" clause captures the trade-off engineers actually face every day. Even without a partition, replicating data synchronously to guarantee consistency adds latency. Accepting eventual consistency reduces latency. This is the decision that determines your database's day-to-day behavior, and CAP says nothing about it.
PACELC Classification
| Database | Partition Behavior | Normal Behavior | Classification |
|---|---|---|---|
| DynamoDB | Availability | Latency | PA/EL |
| Cassandra | Availability | Latency | PA/EL |
| Spanner | Consistency | Consistency | PC/EC |
| CockroachDB | Consistency | Consistency | PC/EC |
| MongoDB | Availability | Consistency | PA/EC |
| PostgreSQL (single) | Consistency | Consistency | PC/EC |
| Aurora (single-region) | Consistency | Consistency | PC/EC |
| Aurora Global Database | Availability | Latency | PA/EL |
| Redis Cluster | Availability | Latency | PA/EL |
The PA/EL systems (DynamoDB, Cassandra, Aurora Global) optimize for speed and uptime at the cost of consistency guarantees. The PC/EC systems (Spanner, CockroachDB, single-region Aurora) optimize for correctness at the cost of latency. MongoDB lands in an interesting middle ground: it favors availability during partitions but provides strong consistency during normal operation through its single-primary architecture.
Why PACELC Matters More Than CAP in Practice
I stopped using CAP as my primary framework for database selection years ago. PACELC captures the trade-off that actually drives architectural decisions. Network partitions are rare events. Latency is a constant. When I evaluate a database for a new service, I ask: "What happens to read latency when I need consistent reads?" That question has determined more architectural decisions than "What happens during a network partition?" ever has.
Consistency Models: A Hierarchy
Consistency is not binary. Between "perfectly consistent" and "eventually consistent" sits a spectrum of guarantees, each with different performance characteristics and programming complexity.
The Full Spectrum
| Model | Guarantee | Cost | Use Case |
|---|---|---|---|
| Linearizability | Operations appear instantaneous; reads always return the latest write | Highest latency, lowest throughput | Financial transactions, leader election, distributed locks |
| Sequential Consistency | Operations appear in some total order consistent with each process's local order | High latency | Strongly-ordered event logs, replicated state machines |
| Causal Consistency | Causally related operations appear in order; concurrent operations may appear in any order | Moderate latency | Collaborative editing, social feeds, messaging |
| Read-Your-Writes | A process always sees its own writes | Low overhead | User sessions, shopping carts |
| Monotonic Reads | Once a process reads a value, it never sees an older value | Low overhead | Dashboards, reporting |
| Eventual Consistency | All replicas converge to the same value given sufficient time without new writes | Lowest latency, highest throughput | DNS, CDN caches, analytics, metrics |
Linearizability
Linearizability is the gold standard. Every operation appears to execute atomically at some point between its invocation and its response. If client A writes a value and client B reads after A's write returns, B sees A's write. Full stop.
The cost is real. Achieving linearizability in a distributed system requires coordination between nodes on every operation. Spanner uses GPS-synchronized atomic clocks (TrueTime) to keep its uncertainty window under 7 milliseconds, then forces every write to wait out that window before acknowledging. CockroachDB, running on commodity hardware without atomic clocks, faces NTP uncertainty windows of 100-250 milliseconds. Rather than forcing writes to wait that long, CockroachDB provides serializability (slightly weaker than linearizability) and uses read restarts to handle clock skew.
Causal Consistency
Causal consistency sits in a sweet spot that more architects should consider. It guarantees that if operation A causally affects operation B (A happened before B, and B depends on A's result), then every node sees A before B. Operations with no causal relationship can appear in any order.
This model maps naturally to most application semantics. If a user posts a message and then edits it, every reader should see the post before the edit. If two unrelated users post simultaneously, it does not matter which post appears first. Causal consistency provides this guarantee without the coordination overhead of linearizability.
Systems that implement causal consistency track dependencies using vector clocks or similar mechanisms. Each operation carries metadata describing which prior operations it depends on. Nodes delay delivering an operation until all its dependencies have been applied locally.
Eventual Consistency
Eventual consistency guarantees only that all replicas will converge to the same value given sufficient time with no new writes. It says nothing about how long convergence takes. It says nothing about what intermediate states a reader might observe. It provides no ordering guarantees whatsoever.
The appeal is performance. Without coordination requirements, every node can accept writes independently and replicate asynchronously. DynamoDB's eventually consistent reads are half the cost and measurably faster than strongly consistent reads. S3 operated with eventual consistency for a decade before Amazon engineered strong read-after-write consistency in December 2020 (with no performance penalty, which tells you something about how much headroom they had).
The danger is that "eventual" has no upper bound. In practice, DynamoDB's replication typically converges within milliseconds. In theory, there is no guarantee. During heavy load, replication lag can spike. During a partition, replicas can diverge indefinitely. Application code must handle all of these cases if it reads eventually consistent data.
Quorum Mechanics
The R + W > N Formula
Quorum-based replication is the most common mechanism for implementing tunable consistency. The formula is simple:
- N = total number of replicas
- W = number of replicas that must acknowledge a write
- R = number of replicas that must respond to a read
If R + W > N, at least one node in every read quorum participated in the most recent write quorum. That overlap guarantees the read returns the latest value.
| Configuration | N | W | R | R + W | Consistency | Trade-off |
|---|---|---|---|---|---|---|
| Strong consistency | 3 | 2 | 2 | 4 > 3 | Strong | Balanced latency, tolerates 1 failure |
| Write-heavy | 3 | 1 | 3 | 4 > 3 | Strong | Fast writes, slow reads |
| Read-heavy | 3 | 3 | 1 | 4 > 3 | Strong | Fast reads, slow writes, no write fault tolerance |
| Eventual | 3 | 1 | 1 | 2 < 3 | Eventual | Fast everything, stale reads possible |
| High durability | 5 | 3 | 3 | 6 > 5 | Strong | Tolerates 2 failures, higher latency |
Tunable Consistency in Cassandra
Cassandra exposes the quorum math directly to the application through per-query consistency levels. In a cluster with replication factor 3:
| Consistency Level | Nodes Required | Behavior |
|---|---|---|
ONE | 1 | Fastest. Returns the first response. Stale reads possible. |
TWO | 2 | Moderate consistency. Reduces stale read probability. |
QUORUM | 2 (majority) | Strong consistency when paired with QUORUM writes. Cross-datacenter latency. |
LOCAL_QUORUM | 2 (local DC majority) | Strong consistency within a datacenter. Avoids cross-DC latency. Production default for most workloads. |
EACH_QUORUM | Majority per DC | Strong consistency across all datacenters. Highest latency. |
ALL | 3 | Strongest guarantee. Zero fault tolerance. Avoid in production. |
The production-grade configuration for most Cassandra deployments: write at LOCAL_QUORUM, read at LOCAL_QUORUM. This gives you strong consistency within each datacenter with cross-datacenter eventual consistency. Use QUORUM only when cross-datacenter consistency is required and the latency penalty is acceptable.
What Quorums Cannot Guarantee
Quorums guarantee that a read overlaps with the latest write. They do not guarantee linearizability. Two clients can issue concurrent writes, both receive acknowledgments, and subsequent reads can return either value. Quorums provide "latest write" visibility, not total ordering. Achieving total ordering requires a consensus protocol layered on top of quorum mechanics.
Consensus Protocols
Why Consensus Matters
Quorums handle read/write overlap. Consensus handles agreement: ensuring all nodes agree on the order of operations, who the leader is, or what the current state should be. Without consensus, you get split-brain scenarios where two nodes both believe they are the primary, accept conflicting writes, and corrupt your data.
Paxos
Leslie Lamport published Paxos in 1998. It solves the problem of getting a group of unreliable nodes to agree on a single value. The protocol works in three phases: prepare, accept, and learn. A proposer sends a prepare request with a proposal number. Acceptors promise to reject proposals with lower numbers. Once a majority of acceptors have promised, the proposer sends an accept request. Once a majority accept, the value is chosen.
Paxos is provably correct. It is also notoriously difficult to implement. Google's internal Paxos implementation (used in Chubby, their distributed lock service) went through years of bug fixes. The gap between the theoretical protocol and a production implementation is enormous.
Raft
Diego Ongaro and John Ousterhout designed Raft in 2014 specifically because Paxos was too hard to understand and implement correctly. Raft decomposes consensus into three sub-problems: leader election, log replication, and safety.
The key simplification: Raft constrains which nodes can become leader. Only nodes with up-to-date logs can win an election. Paxos allows any node to be leader, then requires the new leader to reconstruct the latest state. Raft's constraint means the leader always has the latest data, eliminating a major source of implementation bugs.
Raft powers etcd, CockroachDB, TiKV, and most modern consensus implementations. If I am evaluating a distributed system and it uses Paxos directly rather than Raft, I ask why. The answer is usually "legacy."
Leader Election and Split-Brain
Every consensus protocol designates a leader. The leader serializes operations and replicates them to followers. The dangerous moment is leader transition: the old leader has not yet realized it lost leadership, and the new leader has just been elected.
Raft prevents split-brain through term numbers. Each leader election increments the term. Followers reject requests from leaders with stale terms. If a partitioned leader tries to acknowledge writes, its followers will have moved to a higher term and ignore it. The writes the old leader accepted are uncommitted (they never reached a quorum) and get rolled back when the partition heals.
In practice, I have seen split-brain cause data corruption exactly once: a ZooKeeper cluster where network latency between nodes was high enough that session timeouts kept expiring, triggering continuous leader elections. Each election caused a brief window where two nodes both believed they were leader. The fix was adjusting tick intervals and session timeouts to match the actual network latency between nodes. The lesson: consensus protocols are correct in theory, but their configuration parameters must match your network's real-world behavior.
Clock Synchronization: The Hidden Dependency
Distributed consensus should not depend on synchronized clocks. Raft and Paxos use logical ordering (term numbers, proposal numbers) rather than wall-clock time. In theory.
In practice, timeouts drive leader election. If node A's clock runs fast, it triggers elections more frequently than necessary, destabilizing the cluster. If node B's clock runs slow, it may not notice the leader has failed until long after other nodes have elected a replacement.
Spanner solved this with TrueTime: GPS receivers and atomic clocks in every datacenter, providing clock uncertainty bounds under 7 milliseconds globally. CockroachDB uses NTP with 100-250 millisecond uncertainty and compensates with read restarts when clock skew is detected. Both approaches work. Both have operational implications that surface at the worst possible times.
flowchart TD
A[Need distributed
consensus?] --> B{Ordering
requirement}
B -->|Total order required| C{Latency
budget}
C -->|Sub-10ms| D[Spanner / TrueTime
Requires atomic clocks]
C -->|10-100ms acceptable| E[Raft-based system
etcd, CockroachDB]
C -->|Seconds acceptable| F[Multi-Paxos
Chubby, older systems]
B -->|Causal order sufficient| G{Conflict
handling}
G -->|Merge automatically| H[CRDTs
Riak, Redis CRDT]
G -->|Application resolves| I[Vector clocks
DynamoDB streams]
B -->|No ordering needed| J[Eventual consistency
DNS, CDN caches] Conflict Resolution Strategies
When a system chooses availability over consistency (AP in CAP terms), concurrent writes to different replicas create conflicts. Three strategies handle them.
Last-Writer-Wins (LWW)
The simplest approach: attach a timestamp to every write, and when replicas merge, the highest timestamp wins. DynamoDB, Cassandra, and most AP databases use LWW as the default conflict resolution strategy.
LWW has a critical failure mode: it silently drops writes. If two users update the same record within the clock resolution window, one update vanishes. No error. No notification. The "losing" write simply ceases to exist. For counters, accumulators, or any operation where both writes carry meaningful state, LWW destroys data.
I once debugged a system where inventory counts were drifting downward over weeks. Two microservices were updating the same DynamoDB item with stock counts. Whichever write arrived last "won" and overwrote the other's count. The fix was switching to DynamoDB's atomic counter operations, which bypass LWW entirely by using ADD expressions that the storage layer applies sequentially.
Vector Clocks
Vector clocks track causal history. Each node maintains a vector of counters, one per node. When a node writes, it increments its own counter. When nodes synchronize, they merge vectors by taking the element-wise maximum. Two writes are concurrent if neither vector dominates the other.
When concurrent writes are detected, the system can either merge them automatically (if the data type supports it) or surface the conflict to the application for resolution. Amazon's original Dynamo paper described this approach: the shopping cart service presented conflicting versions to the application layer, which merged them by taking the union of cart items.
Vector clocks are correct but they grow linearly with the number of nodes. In a system with thousands of nodes, the metadata overhead per operation becomes significant. Dotted version vectors and other compressed representations address this, but they add implementation complexity.
Conflict-Free Replicated Data Types (CRDTs)
CRDTs represent the most sophisticated approach to conflict resolution. A CRDT is a data structure that is mathematically guaranteed to converge to the same state on all replicas, regardless of the order in which operations are applied. No coordination. No conflict resolution logic. Convergence is a property of the data type itself.
Common CRDT types:
| CRDT | Operation | Merge Strategy | Use Case |
|---|---|---|---|
| G-Counter | Increment only | Sum per-node counts | Page views, event counters |
| PN-Counter | Increment and decrement | Separate G-Counters for adds and removes | Like/dislike counts, stock levels |
| G-Set | Add only | Union | Tag collections, feature flags |
| OR-Set | Add and remove | Track add/remove operations with unique tags | Shopping carts, user preferences |
| LWW-Register | Overwrite | Highest timestamp wins | Last-updated fields, status flags |
| MV-Register | Overwrite | Preserve all concurrent values | Collaborative editing |
CRDTs power Redis's CRDT module, Riak's data types, and Azure Cosmos DB's conflict resolution. Riot Games uses Riak CRDTs for League of Legends' in-game chat, handling 7.5 million concurrent users and 11,000 messages per second. The OR-Set CRDT tracks chat room membership; players joining and leaving across different server nodes converge without coordination.
The limitation: CRDTs work only for data types where a commutative, associative, idempotent merge function exists. Not everything fits. You cannot build a CRDT for a bank account balance that supports arbitrary debits (what happens if two concurrent debits overdraw the account?). For those cases, you need coordination, which means consensus, which means latency.
How AWS Services Map to These Models
Understanding the consistency model of every AWS service you depend on prevents the most common class of distributed systems bugs.
DynamoDB
DynamoDB defaults to eventual consistency. Strongly consistent reads are available per-request by setting ConsistentRead: true, at twice the cost (and with the limitation that strongly consistent reads only work against the primary node in a partition's replication group).
Global tables replicate across regions with eventual consistency and use LWW conflict resolution. If two regions write the same item within the replication window (typically under one second), the write with the later timestamp wins. There is no multi-region strong consistency option; if you need it, you need a single-region deployment or an application-level coordination layer.
DynamoDB Streams provide ordered change capture within a shard, giving you causal ordering for single-partition operations. Cross-partition ordering requires application-level sequencing.
For a deeper look at DynamoDB's architecture, see AWS DynamoDB: An Architecture Deep-Dive.
Aurora
Single-region Aurora provides strong consistency through a quorum-based storage layer. Writes go to a single primary instance and replicate synchronously to a quorum of storage nodes across three AZs. Read replicas serve eventually consistent reads with replication lag typically under 100 milliseconds (and often under 20 milliseconds because replicas share the same storage volume).
Aurora Global Database replicates asynchronously across regions with typical lag under one second. Write forwarding allows secondary regions to forward writes to the primary region, adding cross-region latency to every write. There is no cross-region strong consistency; the secondary regions are always eventually consistent with the primary.
For Aurora architecture details, see MySQL vs. PostgreSQL on Aurora: An Architecture Deep Dive and AWS Aurora: Getting Close to Multi-Region Active/Active.
S3
S3 delivered eventual consistency from its launch in 2006 until December 2020, when Amazon shipped strong read-after-write consistency for all operations at no additional cost and with no performance penalty. Every S3 GET, PUT, LIST, and metadata operation now reflects the most recent write.
The engineering behind this change is remarkable. Amazon built strong consistency into a system handling tens of trillions of objects and millions of requests per second without degrading performance. The lesson: eventual consistency was a design choice, not a physical constraint. Amazon chose to eliminate it when the engineering became feasible.
ElastiCache
Redis clusters provide eventual consistency across replicas. Writes go to the primary; replicas sync asynchronously. During a primary failover, the promoted replica may be behind the old primary, resulting in lost writes. Redis Cluster uses LWW and last-failover-wins semantics, which means acknowledged writes can vanish during failovers.
For production workloads where write durability matters, configure min-replicas-to-write to require acknowledgment from at least one replica before confirming a write. This does not provide strong consistency, but it reduces the window for data loss during failovers.
See Amazon ElastiCache: An Architecture Deep-Dive for the full ElastiCache architecture.
Choosing a Consistency Model
The Decision Framework
The right consistency model depends on three factors: what your application can tolerate, how much latency your users can tolerate, and how wide your geographic distribution needs to be.
| Requirement | Recommended Model | AWS Service Example |
|---|---|---|
| Financial transactions, inventory management | Linearizability / Strong consistency | Aurora (single-region), DynamoDB (strongly consistent reads) |
| User-facing reads where staleness is visible | Read-your-writes, monotonic reads | DynamoDB (consistent reads for the writing user, eventual for others) |
| Multi-region with sub-100ms reads | Eventual consistency | DynamoDB Global Tables, Aurora Global Database, ElastiCache Global Datastore |
| Collaborative editing | Causal consistency | Custom implementation with CRDTs or vector clocks |
| Analytics, metrics, dashboards | Eventual consistency | S3, Athena, OpenSearch |
| Distributed locks, leader election | Linearizability | DynamoDB (conditional writes), ElastiCache (with Redlock caveats) |
The Three Questions
Before choosing a consistency model for any new service, I ask three questions:
1. What happens if a user reads stale data? If the answer is "nothing bad" (metrics dashboards, recommendation engines, content feeds), use eventual consistency and take the performance win. If the answer is "they see incorrect account balances" or "two users buy the last item in stock," you need strong consistency on the critical path.
2. How far apart are your writers? Strong consistency across regions requires synchronous cross-region replication. At 70 milliseconds of round-trip latency between US-East and EU-West, every strongly consistent write adds at least 70 milliseconds. If your writers are in multiple regions, either accept eventual consistency across regions or route all writes through a single region and tolerate the latency.
3. What is your conflict resolution strategy? If concurrent writes are possible (and in any multi-writer system, they are), you need a plan. LWW is acceptable for data where the latest value is always the correct value (user profile updates, session data). For counters, sets, or any accumulating data, use atomic operations or CRDTs. For complex business logic (order fulfillment, inventory reservation), use conditional writes or optimistic concurrency control.
Key Patterns
CAP is a starting point, not a framework. Use PACELC for actual database selection. The latency-consistency trade-off during normal operations matters far more than the availability-consistency trade-off during rare partition events.
Match consistency to the operation, not the system. DynamoDB lets you choose consistency per read. Use strongly consistent reads for the operations that need them and eventual consistency everywhere else. The same principle applies at the architecture level: use a strongly consistent database for your transaction log and an eventually consistent cache for your read layer.
Quorum math is necessary but not sufficient. R + W > N guarantees read freshness. It does not guarantee linearizability, total ordering, or protection against concurrent write conflicts. Layer a consensus protocol on top if you need those guarantees.
CRDTs are underused. Most AP systems default to LWW because it is simple. For counters, sets, and registers, CRDTs provide strong eventual consistency (guaranteed convergence without coordination) at the cost of some metadata overhead. If your application can express its state mutations as CRDT operations, you get the performance of AP with better correctness guarantees than LWW.
Test your consistency assumptions. Jepsen (jepsen.io) has tested every major distributed database and found consistency violations in nearly all of them. CockroachDB, MongoDB, Cassandra, Redis, PostgreSQL: all have had documented consistency bugs. The formal model and the implementation are different things. Run your own failure injection tests under load.
S3 proved eventual consistency is often a choice, not a constraint. Amazon eliminated S3's eventual consistency with no performance penalty. When a vendor tells you eventual consistency is a necessary trade-off for performance, ask whether that is a fundamental limitation or an engineering priority decision.
Additional Resources
- Eric Brewer's original CAP conjecture (PODC 2000)
- Gilbert and Lynch's CAP proof (2002)
- Daniel Abadi's PACELC paper: Consistency Tradeoffs in Modern Distributed Database System Design
- Aphyr's guide to strong consistency models
- Jepsen: Distributed Systems Safety Research
- Amazon S3 Strong Read-After-Write Consistency announcement
- CockroachDB: Living Without Atomic Clocks
- Spanner: TrueTime and External Consistency
- Amazon DynamoDB Read Consistency documentation
- Google SRE Book: Managing Critical State (Distributed Consensus)
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

