Skip to main content

AWS Aurora: Getting Close to Multi-Region Active/Active

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

Every production architecture conversation I've had in the last five years eventually lands on the same question: can we go active/active across regions? The answer with Aurora has historically been "sort of, with significant caveats." Aurora Global Database gives you cross-region reads and fast failover. Write forwarding lets secondary regions send writes to the primary. Aurora DSQL promises genuine multi-region active/active with strong consistency. Each of these represents a different point on the spectrum between "one region writes, everyone else reads" and "any region writes, strong consistency everywhere." I've deployed all of them. The operational reality of each is more nuanced than the marketing suggests.

This is an architecture reference for engineers evaluating multi-region database strategies on AWS. It covers Aurora's storage internals (because they constrain every multi-region option), each approach to multi-region writes, and the trade-offs that determine which pattern fits your workload. If you need a getting-started guide for Aurora, look elsewhere. What follows assumes you already run Aurora in production and want to understand the multi-region landscape.

How Aurora Storage Works Under the Hood

Every multi-region Aurora architecture builds on the same storage engine. Understanding it explains why Aurora's multi-region story looks the way it does.

The Log-Structured Storage Layer

Aurora's central architectural insight: decouple compute from storage, and only ship log records across the network. Traditional MySQL and PostgreSQL replicate full data pages between primary and replicas. Aurora replicas share a single distributed storage volume and only replicate redo log records. The storage layer assembles data pages from the log records on demand.

This design reduces network I/O by roughly 7x compared to traditional MySQL replication. The database engine sends only redo log entries to storage nodes. Those storage nodes apply the log records to their local copies of data pages asynchronously. The compute layer never writes data pages to storage.

Quorum Writes and Protection Groups

Aurora divides each database volume into 10 GB chunks called protection groups. Each protection group replicates across six storage nodes spanning three Availability Zones (two nodes per AZ):

ComponentValue
Protection group size10 GB
Total replicas per protection group6
AZs per protection group3 (2 nodes each)
Write quorum4 of 6
Read quorum3 of 6
Tolerates AZ failureYes (lose 2 nodes, retain 4)
Tolerates AZ failure + 1 nodeYes (lose 3 nodes, retain 3 for reads)
Max volume size128 TiB

Writes succeed when four of six storage nodes acknowledge. Reads succeed with three of six. This means Aurora tolerates losing an entire AZ (two nodes) and still processes both reads and writes. Lose an AZ plus one additional node, and you still serve reads. The quorum math is the reason Aurora advertises 99.99% availability for multi-AZ deployments.

The storage nodes themselves use a mix of full segments (data pages plus log records) and tail segments (log records only). Each AZ holds one full segment and one tail segment per protection group. This reduces storage costs without sacrificing durability.

Why Single-Writer Matters

Aurora's storage engine relies on a monotonically increasing Log Sequence Number (LSN) to order all writes. A single writer instance generates this LSN sequence. Every storage node, every replica, and every read consistency check depends on this ordered log. Introducing a second writer would require distributed consensus on every write to maintain a globally ordered log, and that would destroy the latency advantage that makes Aurora fast.

This is the fundamental constraint that shapes every Aurora multi-region strategy. The storage layer is designed around a single ordered write stream. Every approach to multi-region writes works around this constraint, and each workaround carries trade-offs.

Aurora Global Database

Aurora Global Database replicates an Aurora cluster across up to six AWS regions. One region is primary (reads and writes). Up to five secondary regions serve reads.

Storage-Level Replication

Global Database replicates at the storage layer, not the database engine layer. The primary region's storage nodes stream redo log records directly to secondary region storage, bypassing the database engine entirely. This is faster and lighter than traditional logical replication:

MetricAurora Global DatabaseTraditional Cross-Region Replication
Replication mechanismStorage-level redo log streamingLogical replication (engine layer)
Typical replication lag< 1 second1-30 seconds
Impact on primary performanceMinimal (dedicated replication infrastructure)Moderate (engine does extra work)
Secondary region capabilityRead-only (or write forwarding)Read-only
Failover RPOSeconds (typically < 1s data loss)Minutes
Switchover RPO0 (synchronizes before switching)Depends on replication lag

Failover Mechanics

Global Database supports two recovery operations:

Switchover (planned). Demotes the primary, promotes a secondary. RPO is zero because Aurora synchronizes all data before switching roles. Typical completion time: 1-2 minutes. Use this for planned region migrations, maintenance windows, and DR testing.

Failover (unplanned). Promotes a secondary immediately without synchronizing. RPO depends on replication lag at the moment of failure (typically under 1 second of data loss). Completion time: under 1 minute. Use this during regional outages.

Note
Test your failover. I've seen teams configure Global Database and never validate that their application actually reconnects to the new primary. DNS caching, connection pooling, and application-level retries all need testing. A failover that works on paper and fails in practice provides zero value.

The Active-Passive Reality

Global Database without write forwarding is strictly active-passive. One region handles all writes. Secondary regions serve reads with sub-second lag. If your application can tolerate read-only secondaries and rare failover events, this is the simplest multi-region Aurora architecture. Operationally, it works well. The replication lag is low enough that most read workloads never notice the inconsistency.

The problem: any write from a user in a secondary region must traverse the network to the primary region. If your users are in Tokyo and your primary is in Virginia, every write incurs 150-200ms of network latency on top of the database operation. For read-heavy workloads (and most web applications are read-heavy), this is acceptable. For write-intensive workloads distributed globally, it becomes a bottleneck.

Write Forwarding: The Active-Active Approximation

Write forwarding is Aurora's attempt to make Global Database feel active/active without changing the single-writer storage architecture.

How It Works

With write forwarding enabled, applications connect to a secondary cluster's reader endpoint and issue both reads and writes. The secondary cluster serves reads locally (fast, sub-second lag from primary). For writes, the secondary transparently forwards the SQL statement to the primary region's writer instance. The primary executes the write, commits it to storage, and the change replicates back to the secondary through normal storage-level replication.

flowchart LR
    subgraph US_EAST["US East (Primary Region)"]
        W[Writer Instance]
        S1[Storage Volume]
    end
    subgraph EU_WEST["EU West (Secondary Region)"]
        R[Reader Instance
+ Write Forwarding] S2[Storage Volume] end App_US[US Application] -->|reads + writes| W App_EU[EU Application] -->|reads local| R App_EU -->|writes forwarded| R R -->|forward write SQL| W W --> S1 S1 -->|redo log stream
sub-second| S2 S2 --> R
Aurora Global Database with write forwarding

The application connects to the secondary region and issues normal SQL. Aurora handles the forwarding transparently. No application-level write routing required.

Consistency Levels

Write forwarding supports three consistency modes, configured per session:

Consistency LevelBehaviorLatency ImpactUse Case
EVENTUALWrites forward to primary; subsequent reads may not reflect the write immediatelyLowestAnalytics dashboards, non-critical reads
SESSIONWithin the same session, reads after a forwarded write reflect that writeModerate (waits for replication)Most application workloads
GLOBALAll sessions on the secondary see the write immediately after it commitsHighest (waits for full replication)Multi-user collaborative workflows

SESSION consistency is the right default for most applications. It guarantees read-your-own-writes within a single database connection. GLOBAL consistency blocks until the write propagates fully to the secondary storage, which means you pay the full round-trip replication latency on every write.

Latency and Performance Reality

Write forwarding adds latency to every write. The forwarded write must travel from the secondary region to the primary, execute, commit, and (depending on consistency level) replicate back before the application sees a response:

ComponentTypical Latency
Network round-trip (cross-region)50-200ms depending on regions
Write execution on primary1-10ms
Replication back to secondary< 1 second
Total (EVENTUAL consistency)50-210ms
Total (SESSION consistency)100-400ms
Total (GLOBAL consistency)150-600ms

Compare this to a direct write on the primary: 1-10ms. Write forwarding turns a single-digit millisecond operation into a multi-hundred millisecond operation. For infrequent writes from secondary regions, this is fine. For write-heavy workloads, it becomes a performance bottleneck.

When Write Forwarding Breaks Down

AWS explicitly states they do not recommend write forwarding for active-active application processing across regions. That caveat is earned. Scenarios where write forwarding fails to deliver:

  • High write volume from secondary regions. Every write saturates the cross-region link and loads the primary writer. If 50% of your writes originate from secondary regions, you've doubled the primary's write load plus added cross-region latency to half your transactions.
  • Latency-sensitive writes. Any workflow where users wait for a write to complete (form submissions, checkout, real-time updates) will feel slow from secondary regions.
  • Transaction-heavy workloads. Multi-statement transactions hold resources on the primary for the duration of the cross-region round trip per statement. Long transactions from remote regions increase lock contention on the primary.
  • DDL statements. Schema changes cannot be forwarded. They must run on the primary directly.

Write forwarding works well for a specific pattern: applications that are overwhelmingly read-heavy (90%+ reads), with occasional writes from secondary regions that can tolerate higher latency. Content management systems, product catalogs with regional editorial teams, and configuration dashboards fit this profile.

Aurora Limitless Database: Horizontal Write Scaling

Aurora Limitless Database (GA for PostgreSQL) addresses a different constraint: scaling writes beyond a single instance within a single region.

Router and Shard Architecture

Limitless Database introduces a two-tier architecture: routers and shards. Routers accept client connections, parse SQL, determine which shards hold the relevant data, and aggregate results. Shards are Aurora PostgreSQL instances that each store a subset of the data.

ComponentFunctionClient-Accessible
RouterAccepts connections, routes queries, aggregates results, manages distributed transactionsYes (via DB cluster endpoint)
ShardStores data subset, executes local queriesNo (routers only)

Table Types

Table TypeData LocationWrite VolumeJoin PerformanceUse Case
ShardedDistributed by hash of shard keyHighBest when joining on shard key (co-located)Large, high-throughput tables
ReferenceFull copy on every shardLow to moderateFast (always local)Small lookup tables, dimensions
StandardSingle system-selected shardAnyFast within standard tablesLegacy tables, small datasets
Co-located shardedSame shard as parent tableHighFast (same shard as parent)Child tables in parent-child relationships

Sharding is hash-based only. No range or list partitioning. Choose your shard key carefully: all data with the same shard key value lands on the same shard. Cross-shard queries work (the router coordinates them) but cost more in latency and resources than single-shard queries.

Where Limitless Fits

Limitless solves write throughput scaling within a single region. It does not solve multi-region active/active. You can combine Limitless with Global Database for cross-region reads, but writes still flow to a single primary region. Think of Limitless as vertical scaling (more write capacity) combined with horizontal data distribution, while Global Database handles geographic distribution.

Aurora DSQL: True Multi-Region Active/Active

Aurora DSQL is AWS's answer to Google Spanner and CockroachDB. Launched in preview at re:Invent 2024, with GA in mid-2025, DSQL provides genuine active/active writes across regions with strong consistency.

Disaggregated Architecture

DSQL decomposes the database into independently scaling components:

ComponentFunction
Query ProcessorParses SQL, builds execution plans, executes queries
AdjudicatorResolves transaction conflicts using optimistic concurrency control
JournalDistributed transaction log, synchronously replicated across AZs
CrossbarRoutes data between components
Storage ReplicasReceive committed log data, serve reads

Each component scales independently. The query processor scales with query volume. Storage replicas scale with data size. The journal scales with write throughput. This disaggregation is fundamentally different from traditional Aurora, where compute and storage scale together (albeit separately from each other).

Multi-Region with Witness

DSQL's multi-region configuration uses two active regions plus a witness:

flowchart TD
    subgraph R1["Region 1 (Active)"]
        QP1[Query Processor]
        J1[Journal]
        S1[Storage Replicas]
        QP1 --> J1
        J1 --> S1
    end
    subgraph R2["Region 2 (Active)"]
        QP2[Query Processor]
        J2[Journal]
        S2[Storage Replicas]
        QP2 --> J2
        J2 --> S2
    end
    subgraph W["Region 3 (Witness)"]
        WJ[Transaction Log
Encrypted] end App1[Application] --> QP1 App2[Application] --> QP2 J1 <-->|synchronous
replication| J2 J1 -->|log records| WJ J2 -->|log records| WJ
Aurora DSQL multi-region architecture

Both active regions accept reads and writes. The witness region stores encrypted transaction logs and participates in quorum decisions but does not serve application traffic. If one active region fails, the other continues operating with the witness providing quorum.

Optimistic Concurrency Control

DSQL uses optimistic concurrency control (OCC) instead of traditional locking. Transactions execute without acquiring locks. At commit time, the adjudicator checks for conflicts. If two transactions modified the same data, the later one aborts and retries.

OCC works well when write conflicts are rare (different users modifying different data). It performs poorly when many transactions contend for the same rows. For a multi-tenant application where each tenant's data is independent, OCC is excellent. For a single counter or a shared inventory record, conflicts cause frequent aborts.

Availability Targets

ConfigurationAvailability SLA
Single-region DSQL99.99%
Multi-region DSQL (linked clusters)99.999%

99.999% availability means roughly 5 minutes of downtime per year. That is a stronger guarantee than any other Aurora configuration.

Limitations and Trade-offs

DSQL achieves its distributed architecture by dropping PostgreSQL features that assume a single-node database:

FeatureDSQL Support
PostgreSQL wire protocolYes
Standard SQL queriesYes (with restrictions)
Foreign keysNo
Temporary tablesNo
ViewsNo
Stored proceduresLimited
TriggersNo
Max rows per transaction10,000
SequencesNo (use UUIDs)
ExtensionsVery limited

These limitations are significant. If your application relies on foreign key enforcement, stored procedures, triggers, or views, DSQL requires substantial refactoring. For greenfield applications designed around DSQL's constraints, the trade-off can be worth it. For existing PostgreSQL applications, migration is a major effort.

AWS claims DSQL delivers reads and writes four times faster than Google Spanner. Whether that holds under your specific workload requires benchmarking, but the architectural approach (disaggregated components, OCC, local reads) favors low-latency operations when conflicts are rare.

The Multi-Region Decision Matrix

Choosing the right Aurora multi-region strategy depends on your write patterns, latency requirements, consistency needs, and tolerance for operational complexity.

CriteriaGlobal Database (Active-Passive)Global DB + Write ForwardingAurora DSQLCockroachDB / Spanner
Write regions11 (writes forwarded)2 (+ witness)Any number
Read regionsUp to 6Up to 62Any number
Write latency from secondaryN/A (writes go to primary)50-600ms (forwarded)Single-digit ms (local)Single-digit ms (local)
ConsistencyStrong in primary, eventual in secondaryConfigurable (EVENTUAL/SESSION/GLOBAL)Strong snapshot isolationSerializable
Failover time< 1 minute< 1 minuteAutomatic (witness quorum)Automatic
PostgreSQL compatibilityFullFullLimited (no FK, views, triggers)CockroachDB: high; Spanner: moderate
Operational complexityLowLowLow (serverless)Moderate to high
Vendor lock-inHighHighVery highCockroachDB: low; Spanner: high
Regional availabilityAll Aurora regionsAll Aurora regionsLimited (US regions)Global

When to Use Each Approach

Global Database (active-passive): Your workload is read-heavy, writes originate from a single region, and you need cross-region DR with fast failover. This covers the majority of production workloads. Pair it with a load balancer (AWS Elastic Load Balancing: An Architecture Deep-Dive) that routes reads to the nearest region and writes to the primary.

Global Database + write forwarding: You need occasional writes from secondary regions and can tolerate the added latency. Content management systems, admin dashboards, and configuration tools where writes are infrequent and human-initiated. Do not use this for high-volume write workloads from secondary regions.

Aurora DSQL: You need genuine active/active writes across two regions with strong consistency, and your application can work within DSQL's PostgreSQL subset. Greenfield applications targeting 99.999% availability. Evaluate the feature limitations carefully before committing.

CockroachDB or Spanner: You need active/active across more than two regions, require full SQL support, or want to avoid AWS vendor lock-in. These come with higher operational complexity (CockroachDB) or Google lock-in (Spanner) and higher write latencies due to distributed consensus.

Failure Modes and Operational Lessons

Replication Lag Spikes

Aurora Global Database replication lag is typically under 1 second. During sustained heavy write bursts on the primary, lag can spike to multiple seconds. I've seen lag reach 5-10 seconds during large data migrations running on the primary. If your secondary-region application uses SESSION or GLOBAL consistency with write forwarding during a lag spike, write latency in the secondary region spikes proportionally.

Monitor AuroraGlobalDBReplicationLag in CloudWatch. Set alarms at 2 seconds. If your application has strict read-after-write requirements from secondary regions, lag spikes above your tolerance threshold should trigger application-level degradation (direct writes to primary region).

Write Forwarding Timeouts

Forwarded writes cross the public internet (or AWS backbone) between regions. Network partitions, even brief ones, cause forwarded writes to timeout. Your application needs retry logic with exponential backoff. The default write forwarding timeout is 60 seconds. For latency-sensitive workflows, reduce this and fail fast.

Failover Surprises

Three problems I've encountered during Aurora Global Database failover:

Connection pool stale connections. After failover, the old primary's endpoint resolves to the new writer, but existing connections in application connection pools still point to the old writer (now a reader or offline). Connection pool libraries that do not validate connections before use will throw write errors. Configure connection validation (test-on-borrow) and set maximum connection lifetime to 5-10 minutes.

Instance class mismatch after failover. If your secondary region uses smaller instances than the primary (a common cost-saving measure), failover promotes a smaller instance to writer. Under the primary's write load, that instance may be undersized. Match instance classes across regions, or accept that failover degrades write performance.

Application DNS caching. Aurora endpoints use DNS CNAME records. After failover, the CNAME updates, but applications caching DNS beyond TTL will continue sending traffic to the old endpoint. Set DNS TTL to 5-30 seconds in your application and OS resolver configuration. Java applications are particularly prone to aggressive DNS caching via the JVM's default networkaddress.cache.ttl.

DSQL Conflict Storms

DSQL's optimistic concurrency control shines when transactions touch different data. When multiple transactions contend for the same rows, conflict rates climb, throughput drops, and tail latencies increase. A counter table, a shared sequence, or a "latest version" record will generate conflicts under concurrent writes from both regions. Design your schema to minimize cross-transaction overlap. Use UUIDs instead of sequences. Shard hot records by appending random suffixes (the same pattern used in DynamoDB; see AWS DynamoDB: An Architecture Deep-Dive).

Pricing Comparison

Cost varies dramatically across approaches. Here is a representative comparison for a medium-scale deployment (db.r6g.xlarge equivalent, 500 GB storage, moderate I/O):

Cost ComponentGlobal Database (2 regions)Global DB + Write ForwardingDSQL (2 regions + witness)
Primary compute~$0.45/hr ($328/mo)~$0.45/hr ($328/mo)Serverless (pay per request)
Secondary compute~$0.45/hr ($328/mo)~$0.45/hr ($328/mo)Serverless (pay per request)
Storage (per region)$0.10/GB/mo ($50/region)$0.10/GB/mo ($50/region)Included in request pricing
I/O (Standard)$0.20/million requests$0.20/million requestsIncluded
Cross-region replicationIncludedIncludedIncluded
Estimated monthly total~$760~$760Varies by request volume
Reserved Instance savingsUp to 66% (3-year)Up to 66% (3-year)N/A (serverless)

For steady-state workloads, provisioned Aurora Global Database with reserved instances is the most cost-effective multi-region option. DSQL's serverless pricing favors spiky workloads where you would otherwise over-provision.

Additional cost considerations:

  • I/O-Optimized configuration eliminates per-request I/O charges for $0.225/GB/mo storage. Worth it when I/O costs exceed 25% of your total Aurora spend.
  • Global Database storage is charged in each region. 500 GB across 2 regions costs $100/mo in storage alone.
  • Write forwarding adds no extra AWS charges beyond the normal write costs on the primary. You pay for the cross-region network transfer implicitly through the forwarded query's execution on the primary.

Key Patterns

Start with Global Database active-passive. Most applications do not need multi-region writes. Read-heavy workloads with a single write region and fast failover cover the vast majority of production requirements. Resist the temptation to add write forwarding or DSQL complexity unless your workload genuinely demands it.

Use write forwarding for occasional secondary-region writes only. Treat it as a convenience feature for low-volume administrative writes, not as a multi-region write scaling solution. If more than 10% of your writes originate from secondary regions, evaluate whether those writes can be redesigned or whether you need a different architecture.

Evaluate DSQL for greenfield active/active. If you are building a new application that requires 99.999% availability with active/active writes across two US regions, DSQL delivers the strongest AWS-native solution. Verify that your application can work within DSQL's PostgreSQL subset before committing. The limitations on foreign keys, views, and triggers disqualify many existing applications.

Match instance classes across regions. A failover to an undersized secondary region creates a new problem during the original problem. The cost savings from smaller secondary instances evaporate when failover degrades performance at the worst possible time.

Monitor replication lag religiously. Set CloudWatch alarms on AuroraGlobalDBReplicationLag and AuroraGlobalDBReplicatedWriteIO. Replication lag determines your RPO during unplanned failover and your write latency during write forwarding. If lag exceeds your tolerance, investigate the primary's write volume and the secondary's instance capacity.

Test failover quarterly. Use planned switchover to validate that your application reconnects correctly, connection pools refresh, DNS resolves to the new primary, and monitoring alerts fire as expected. The October 2025 DynamoDB/EC2 outage in US-EAST-1 demonstrated that untested DR plans fail when you need them most.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.