Amazon ElastiCache: An Architecture Deep-Dive

November 08, 2025 at 17:30AWS Architecture ElastiCache Caching Redis

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

ElastiCache looks easy. Deploy a managed cache, point your app at the endpoint, enjoy sub-millisecond reads. Then production happens. Engine selection, cluster topology, eviction policy, replication strategy, connection management, failover behavior: every one of these choices determines whether your caching layer holds up or collapses at 3 AM on a Saturday. I've spent years building and running ElastiCache clusters serving millions of requests per second. Some fronted relational databases with multi-terabyte datasets. Others were dead-simple session stores. All of them taught me something. Usually through failure first.

This is an architecture reference. If you're here to understand ElastiCache internals, scaling strategies, failure modes that blindside teams, and how to make sharp decisions about engines, topologies, and the newer serverless offering, you're in the right place.

What ElastiCache Actually Is

ElastiCache is AWS's fully managed in-memory caching service. Provisioning, patching, monitoring, failure detection, automatic failover, backup: AWS handles all of it. You focus on cache key design, eviction strategy, and wiring it into your application. Three engines: Valkey, Redis OSS, and Memcached.

It's changed a lot since 2011. Managed Memcached came first. Redis support landed in 2013, cluster mode enabled in 2016, encryption and RBAC between 2018 and 2020, data tiering on r6gd instances in 2021, ElastiCache Serverless in 2023, Valkey support in 2024. Every addition shifted how you should deploy and operate caching layers on AWS.

The Shift to Valkey

In March 2024, Redis Ltd. changed the Redis license from the permissive BSD 3-Clause to a dual SSPL/RSAL model. Cloud providers couldn't offer Redis as a managed service without a commercial agreement anymore. AWS, Google, Oracle, Ericsson, and the Linux Foundation responded by forking Redis 7.2.4 into Valkey (open-source, BSD license, under the Linux Foundation).

If you were around for the Elasticsearch/OpenSearch fork in 2021, same playbook. For ElastiCache users, the practical result is a transition. AWS now recommends Valkey for all new deployments. Pricing is 33% lower for serverless and 20% lower for node-based configurations versus Redis OSS. Valkey is wire-protocol compatible with Redis, so your existing client libraries, commands, and data structures work without modification. The Valkey project has already shipped its own innovations: I/O multi-threading in Valkey 8.0, performance improvements in 8.1, bloom filters, and vector search.

New project? Valkey. Already running Redis OSS clusters? Plan a migration but don't rush it. The transition is straightforward, and ElastiCache still supports Redis OSS.

Engine Comparison

Feature	Valkey	Redis OSS	Memcached
License	BSD (Linux Foundation)	SSPL/RSAL (Redis Ltd.)	BSD
Data structures	Strings, lists, sets, sorted sets, hashes, streams, HyperLogLog, bitmaps, geospatial	Same as Valkey	Strings only
Max value size	512 MB	512 MB	1 MB (default, configurable)
Persistence	RDB snapshots, AOF	RDB snapshots, AOF	None
Replication	Primary/replica with automatic failover	Primary/replica with automatic failover	None
Clustering	Hash slot sharding (16,384 slots)	Hash slot sharding (16,384 slots)	Client-side consistent hashing
Pub/Sub	Yes	Yes	No
Lua scripting	Yes	Yes	No
Transactions	MULTI/EXEC	MULTI/EXEC	CAS (check-and-set) only
Multi-threading	I/O multi-threading (Valkey 8.0+)	Single-threaded command execution	Fully multi-threaded
Serverless	Yes	Yes	No
Data tiering	Yes (r6gd nodes)	Yes (r6gd nodes)	No
ElastiCache pricing	20-33% lower than Redis OSS	Baseline	Comparable to Redis OSS

Valkey/Redis vs. Memcached is a quick decision. Need data structures, persistence, replication, pub/sub, scripting, or serverless? Valkey. Memcached still has one edge: its native multi-threaded architecture saturates all available CPU cores without cluster mode. That said, I pick Valkey for new deployments unless I've got a specific, benchmarked reason to do otherwise.

Redis (Valkey) vs. Memcached: A Deep Comparison

These engines differ at a fundamental level. You need to understand those differences to pick the right one and to know what you're signing up for operationally.

Threading and Execution Model

Valkey/Redis runs a single-threaded event loop for command execution. Every command (GET, SET, ZADD, Lua script) runs sequentially on one CPU core. No lock contention. Deterministic latency regardless of concurrency. Recent versions moved I/O operations (network read/write, disk persistence, memory deallocation) to background threads, and Valkey 8.0 pushed this further with I/O multi-threading. Network I/O now parallelizes while command execution stays single-threaded. Throughput gains are significant.

Memcached goes the other way entirely. A pool of worker threads handles connections and processes commands concurrently. Sixteen-core node? All 16 cores execute commands simultaneously. For simple GET/SET workloads on big instances, Memcached wins on raw throughput. The catch: that assumes you're CPU-bound rather than memory-bound or network-bound. In practice, that assumption holds less often than people think.

Data Structures and Complexity

Valkey/Redis has seven primary data structures with purpose-built commands and O(1) or O(log N) operations for common access patterns. Sorted sets give you leaderboards and range queries. Streams give you durable pub/sub with consumer groups. HyperLogLog does cardinality estimation in 12 KB. Bitmaps let you store per-user feature flags in megabytes instead of gigabytes.

Memcached supports strings. That's it. Serialize your data, store it, deserialize on read. No atomic operations beyond CAS (compare-and-swap), no range queries, no aggregations, no pub/sub.

Persistence and Durability

Valkey/Redis offers two persistence mechanisms: RDB snapshots (point-in-time binary dumps) and AOF (append-only file logging). Together they let you recover from node restarts with minimal data loss. ElastiCache automates snapshot management with configurable retention up to 35 days.

Memcached has no persistence. None. Node restart means total data loss. Node failure permanently destroys every key on that node, and there's no recovery mechanism.

Replication and High Availability

Valkey/Redis supports primary/replica replication with automatic failover. Each shard can hold up to five read replicas distributed across Availability Zones. Primary fails? ElastiCache promotes a replica in 30-60 seconds.

Memcached has no replication. Each key lives on exactly one node. That node fails, every key on it vanishes instantly, and the resulting cache miss storm hammers your database.

Lua Scripting

Valkey/Redis supports server-side Lua scripting that executes atomically. Read a value, compute something, write a result: one atomic operation, zero race conditions. I've used this for rate limiting with sliding windows, conditional updates, and multi-step transactions that would otherwise need distributed locks.

Memcached? No scripting capability at all.

Architecture Internals

You need to understand how ElastiCache clusters are built before anything else. Topology, sizing, and resilience decisions all flow from this.

Cluster Components

Component	Purpose	Key Considerations
Nodes	Individual cache instances running on EC2	Choose node type based on memory, CPU, and network requirements
Shards (node groups)	A primary node + 0-5 replicas	Each shard holds a subset of the keyspace (cluster mode enabled) or the full keyspace (cluster mode disabled)
Replication group	Collection of shards forming a logical cluster	The unit of failover, backup, and scaling operations
Parameter group	Engine configuration settings	Controls eviction policy, memory limits, timeouts, slow log thresholds, and feature flags
Subnet group	VPC subnet placement	Determines AZ distribution and network isolation for cache nodes
Security group	Network access control	Controls which resources can connect to the cache on port 6379 (Valkey/Redis) or 11211 (Memcached)

Node Types

ElastiCache node types follow the EC2 naming convention with a cache. prefix. Getting the family right is one of the highest-leverage decisions you'll make on cost and performance.

Family	Processor	Optimized For	Production Use Case
cache.t3 / cache.t4g	Intel / Graviton2	Burstable CPU	Development, testing, small workloads with variable traffic
cache.m6g / cache.m7g	Graviton2 / Graviton3	Balanced compute and memory	General-purpose caching where CPU and memory needs are balanced
cache.r6g / cache.r7g	Graviton2 / Graviton3	Memory-optimized	Large datasets, high memory-to-CPU ratio workloads
cache.r6gd	Graviton2 + NVMe SSD	Data tiering (memory + SSD)	Very large datasets with hot/cold access patterns
cache.c7gn	Graviton3 + enhanced networking	Network-optimized	Extremely high throughput requirements

R7g nodes deliver up to 28% more throughput and 21% better P99 latency over R6g, plus 25% higher networking bandwidth. Graviton-based instances consistently run 20-30% better price-performance than Intel equivalents.

Sizing advice: start with cache.r6g.large or cache.r7g.large for most production workloads. The r-family gives you the most memory per dollar, and memory is almost always the binding constraint. Go with m-family only when your workload is genuinely CPU-bound (heavy Lua scripting, complex sorted set operations). T-family? Dev and test only. Burstable instances will surprise you with latency spikes once CPU credits deplete. I learned that one the hard way during a load test that looked great for two hours, then fell off a cliff.

Parameter Groups

Parameter groups control engine-level behavior. Defaults work fine for development. Production needs tuning.

Parameter	Default	Production Recommendation	Rationale
maxmemory-policy	volatile-lru	Depends on use case (see Eviction Policies)	Controls behavior when memory is exhausted
timeout	0 (disabled)	300-600 seconds	Prevents connection leaks from crashed clients
tcp-keepalive	300	60-120 seconds	Detects dead connections faster
activedefrag	no	yes	Reduces memory fragmentation on long-running clusters
lazyfree-lazy-eviction	no	yes	Moves eviction memory reclamation to background thread
lazyfree-lazy-expire	no	yes	Moves TTL expiration memory reclamation to background thread
lazyfree-lazy-server-del	no	yes	Moves server-initiated DEL to background thread
slowlog-log-slower-than	10000 (10ms)	5000 (5ms)	Catch slow operations earlier
notify-keyspace-events	"" (disabled)	Enable only if needed	Keyspace notifications add CPU and memory overhead

Subnet Groups and Network Placement

ElastiCache clusters run inside your VPC. No public endpoint, period. A subnet group defines which subnets (and therefore which AZs) receive cache nodes.

For production clusters with Multi-AZ enabled, include subnets in at least two AZs. I prefer three. ElastiCache distributes primary and replica nodes across available AZs automatically.

Your applications must be in the same VPC, a peered VPC, or connected via Transit Gateway to reach ElastiCache. No internet-facing endpoint. No NAT Gateway path. No VPC endpoint like DynamoDB or S3 have. Plan your network topology accordingly.

Cluster Mode: Disabled vs. Enabled

One of the most consequential architectural decisions you'll make with ElastiCache for Valkey/Redis. Also one of the hardest to reverse. There's no in-place migration from cluster mode disabled to enabled. You have to create a new cluster and migrate data. I've watched teams learn this the painful way.

Cluster Mode Disabled

Cluster mode disabled means a single shard: one primary node and up to five read replicas. Every node holds the entire dataset.

Characteristic	Detail
Shards	1
Max replicas	5
Max memory	Limited to a single node (up to ~419 GiB on cache.r7g.16xlarge)
Read scaling	Add read replicas (up to 5), reader endpoint load-balances across them
Write scaling	Vertical only: upgrade to a larger node type
Multi-key operations	Unrestricted: MGET, MSET, transactions, Lua scripts work across all keys
Endpoints	Primary endpoint (writes) + Reader endpoint (reads)
Client requirements	Any Redis client; no cluster awareness needed

Simpler to operate. Makes sense when your dataset fits comfortably in a single node with growth headroom and your write throughput stays within what one primary can handle.

Cluster Mode Enabled

Cluster mode enabled partitions the keyspace across multiple shards using 16,384 hash slots. Each key maps to a slot via CRC16(key) mod 16384. Each shard owns a contiguous range of those slots.

Characteristic	Detail
Shards	1 to 500
Max replicas per shard	5
Max nodes per cluster	500 (e.g., 83 shards x 6 nodes, or 500 shards x 1 primary each)
Max memory	Sum of all shard memory (petabyte-scale with data tiering)
Read scaling	Add replicas within each shard
Write scaling	Add more shards (horizontal scaling via online resharding)
Multi-key operations	Only for keys in the same hash slot; use hash tags `{tag}` to co-locate
Endpoints	Configuration endpoint (cluster-aware clients required)
Client requirements	Cluster-aware client (handles MOVED/ASK redirections)

Hash Slot Distribution and Hash Tags

ElastiCache distributes the 16,384 hash slots across shards. Three shards? Each shard typically owns roughly 5,461 slots. Rebalancing tries to distribute slots evenly, but keys aren't uniform in size. Uniform slot distribution doesn't guarantee uniform memory distribution.

Hash tags are critical for multi-key operations in cluster mode. If you need MGET, transactions, or Lua scripts across related keys, those keys must hash to the same slot. Force this by including a common substring in curly braces:

user:{12345}:profile and user:{12345}:preferences both hash based on 12345
They will always land on the same shard, enabling atomic multi-key operations

A mistake I see over and over: teams migrate to cluster mode enabled without planning hash tags. Applications that use MGET across unrelated keys, Lua scripts touching keys on different shards, transactions spanning arbitrary keys. They all break with CROSSSLOT errors. Silent, sudden, and completely avoidable with planning.

Resharding Operations

Cluster mode enabled lets you add or remove shards (online resharding) and rebalance hash slots while the cluster serves traffic. Resharding has real operational costs, though.

Aspect	Impact
Availability	Cluster remains available but migrating keys may experience elevated latency
Duration	Proportional to data volume being migrated; large shards take longer
Configuration changes	Cannot process other configuration changes during resharding
Multi-key operations	Commands spanning migrating and non-migrating slots may fail during migration
Memory overhead	Source and destination shards temporarily hold copies of migrating data

Here's what I tell every team: if there's any chance your dataset will outgrow a single node, start with cluster mode enabled from day one. Even a single-shard cluster mode enabled configuration preserves the option to add shards later. Save yourself the painful migration event down the road.

flowchart TB
  subgraph CMD["Cluster Mode Disabled"]
    direction TB
    P1[Primary Node
Full Dataset] --> R1[Replica 1]
    P1 --> R2[Replica 2]
    CE1[Primary Endpoint] --> P1
    CE2[Reader Endpoint] --> R1
    CE2 --> R2
  end
  subgraph CME["Cluster Mode Enabled"]
    direction TB
    subgraph S1["Shard 1
Slots 0-5460"]
      P3[Primary] --> R3[Replica]
    end
    subgraph S2["Shard 2
Slots 5461-10922"]
      P4[Primary] --> R4[Replica]
    end
    subgraph S3["Shard 3
Slots 10923-16383"]
      P5[Primary] --> R5[Replica]
    end
    CE3[Configuration
Endpoint] --> S1
    CE3 --> S2
    CE3 --> S3
  end

Cluster mode disabled vs. cluster mode enabled architecture

Replication and High Availability

Primary/Replica Architecture

Each shard in a Valkey/Redis replication group has one primary node and zero to five replicas. The primary handles all writes and propagates changes to replicas asynchronously. Replicas serve read traffic and stand ready for promotion if the primary fails.

Replication is async by default. Under normal conditions, lag is sub-millisecond. But there's always a window where a replica hasn't received the latest writes yet. After a failover, a small number of recent writes are gone. If your application can't tolerate any data loss for specific data, that data belongs in a different persistence layer. Full stop.

Multi-AZ with Automatic Failover

Enable Multi-AZ for every production cluster. No qualifications on this one. ElastiCache distributes primary and replica nodes across Availability Zones and automatically promotes a replica when the primary goes down.

The failover sequence looks like this:

ElastiCache detects the primary node is unhealthy (typically within 30-60 seconds)
It selects the replica with the least replication lag
The selected replica is promoted to primary
DNS is updated to point the primary endpoint to the new primary
Other replicas begin replicating from the new primary
A replacement replica is provisioned in the original AZ

Total failover time runs 30-60 seconds in most cases. It can stretch longer under heavy load or with large datasets on the promoted replica. During failover, writes to the affected shard fail. Reads from replicas keep working.

sequenceDiagram
  participant App as Application
  participant EP as Primary Endpoint
  participant P as Primary Node AZ-1
  participant R as Replica Node AZ-2

  Note over P: Primary fails
  App->>EP: Write request
  EP-->>App: Connection error

  rect rgb(255,230,230)
  Note over EP,R: Failover (30-60 seconds)
  R->>R: Promoted to Primary
  EP->>EP: DNS updated to AZ-2
  end

  App->>EP: Retry write
  EP->>R: Route to new Primary
  R-->>App: Success
  Note over P: Replacement replica
provisioned in AZ-1

Multi-AZ automatic failover sequence

Replica Lag Monitoring

Replica lag is the single most important replication health metric. It measures the delay between a write hitting the primary and that write getting applied on a replica. Watch the ReplicationLag CloudWatch metric for every replica node. Don't skip this.

Lag Range	Interpretation	Action
< 1 second	Normal operation	No action needed
1-5 seconds	Elevated; possible network congestion or heavy write load	Investigate primary write throughput and network
5-10 seconds	Warning; failover would lose multiple seconds of writes	Scale up node type or reduce write volume
> 10 seconds	Critical; significant data loss risk on failover	Immediate investigation required

Persistent high replica lag means the primary's write throughput exceeds what the replica can keep up with. Common culprits: sustained write bursts, large key operations, Lua scripts generating tons of writes, slow cross-AZ network.

Read Replicas for Read Scaling

Read replicas do two things. They're failover targets for high availability, and they scale read throughput. The reader endpoint load-balances read connections across all replicas in a shard.

For read-heavy workloads (80%+ reads), adding replicas cuts EngineCPUUtilization on the primary by offloading read traffic. Each replica handles reads independently on its own engine thread, so read throughput scales linearly with replica count.

Replicas cost money, though. Each one is a full node on your bill. If your workload is write-heavy and replicas sit underutilized for reads, you're paying for capacity that only provides HA value. One replica per shard is enough for failover. Add more only when measured read throughput actually demands it.

ElastiCache Serverless

ElastiCache Serverless launched in late 2023 with a fundamentally different operational model. No provisioning nodes, no selecting cluster topologies, no managing scaling policies. You create a serverless cache. ElastiCache handles capacity management.

Architecture

When you create a serverless cache, ElastiCache provisions multi-AZ, replicated infrastructure behind the scenes. You get a single endpoint. Compute and storage scale independently based on workload. No manual intervention.

You still specify VPC and subnet configurations for network isolation. Everything else (node management, AZ distribution, failover, patching, capacity planning) happens transparently.

Aspect	Serverless	Node-Based (Provisioned)
Capacity planning	Automatic	Manual: you choose node types and count
Scaling	Automatic, scales in minutes	Manual: add/remove nodes or shards
High availability	Built-in Multi-AZ, automatic	Must enable Multi-AZ and configure replicas
Patching	Automatic, zero downtime	Requires maintenance windows
Pricing model	Pay per ECPU + storage (GB-hours)	Pay per node-hour (on-demand or reserved)
Configuration	Limited tuning knobs	Full parameter group control
Max throughput	5 million requests/second per cache	Depends on cluster sizing
Engine support	Valkey, Redis OSS	Valkey, Redis OSS, Memcached
Data tiering	Not available	Available on r6gd nodes
Setup time	Under 1 minute	5-15 minutes

ECPU Pricing Model

Serverless pricing has two dimensions.

ElastiCache Processing Units (ECPUs): Each kilobyte of data transferred (reads and writes) costs 1 ECPU. A GET returning a 3 KB value? 3 ECPUs. Simple commands on small values cost 1 ECPU.

Data storage: Charged per GB-hour based on the hourly average of data stored. Valkey serverless has a minimum data storage of 100 MB (90% lower than the 1 GB minimum for Redis OSS), which sets the floor for storage costs.

Dimension	Approximate Rate (us-east-1)
ECPUs (Valkey)	$0.0034 per million ECPUs
Data storage (Valkey)	$0.125 per GB-hour
Minimum storage (Valkey)	100 MB
Minimum storage (Redis OSS)	1 GB

An idle Valkey serverless cache runs about $6/month.

When to Use Serverless vs. Provisioned

Choose serverless when:

Your workload is unpredictable, spiky, or highly variable
You want to minimize operational overhead and skip capacity planning
You're early-stage and don't know steady-state traffic patterns yet
Your team doesn't have deep ElastiCache operational experience
Time to production matters more than cost optimization

Choose node-based (provisioned) when:

Your workload is predictable and reserved instance pricing (up to 48% savings) applies
You need fine-grained control over engine parameters
You need data tiering (r6gd instances with SSD)
Cost optimization at scale is the priority
You require Memcached
You need specific node placement across AZs

My take: start with serverless for new workloads. Monitor ECPU consumption and storage for a few weeks. If monthly spend passes $800-1,200, run the numbers on provisioned clusters with 1-year reserved instances. The crossover point is real, and I've hit it on multiple projects.

Data Tiering

Data tiering on cache.r6gd node types extends ElastiCache storage beyond DRAM by adding locally attached NVMe SSDs. For large datasets, it's one of the most impactful cost optimizations available.

Memory + SSD Architecture

The r6gd data tiering architecture is straightforward:

Keys always stay in DRAM. Only values are candidates for tiering to SSD.
Hot data stays in DRAM. ElastiCache tracks every item's last access time and keeps frequently accessed data in memory.
Cold data migrates to SSD. When DRAM fills up, an LRU algorithm moves infrequently accessed values to local NVMe SSD storage.
SSD reads are transparent. When an application requests a value sitting on SSD, ElastiCache reads it, moves it back to DRAM asynchronously, and returns it to the client.

Added latency for SSD-resident data averages about 300 microseconds (assuming 500-byte string values). That's higher than sub-100-microsecond DRAM latency, sure. Still acceptable for most applications by a wide margin.

Capacity and Cost

Node Type	DRAM	SSD	Total Capacity	Approx. On-Demand (us-east-1)
cache.r6gd.xlarge	26.3 GiB	118 GiB	~144 GiB	~$0.48/hr
cache.r6gd.2xlarge	52.8 GiB	237 GiB	~290 GiB	~$0.96/hr
cache.r6gd.4xlarge	105.8 GiB	474 GiB	~580 GiB	~$1.92/hr
cache.r6gd.8xlarge	211.1 GiB	949 GiB	~1,160 GiB	~$3.84/hr
cache.r6gd.16xlarge	419.7 GiB	1,897 GiB	~2,317 GiB	~$7.67/hr

R6gd nodes offer 4.8x more total capacity (memory + SSD) than equivalent r6g nodes, with over 60% cost savings at maximum utilization. The largest node in a 500-node cluster mode enabled configuration? Up to 1 petabyte (500 TB with one read replica per shard).

Security on Data Tiering Nodes

All Graviton2-based nodes include always-on 256-bit encrypted DRAM. Items stored on NVMe SSDs get encrypted by default using XTS-AES-256 block cipher in a hardware module on the node. This encryption is active even if you didn't explicitly enable encryption at rest. It's a hardware-level feature of the r6gd platform.

When Data Tiering Makes Sense

Data tiering works well when your dataset is large (hundreds of GiB to TBs), only 10-20% of the data gets regular access, and you can tolerate the extra ~300 microsecond latency on cold reads.

Skip it when every read needs consistent sub-millisecond latency, when your access pattern is uniformly random with no hot/cold separation, or when your dataset fits in memory on standard nodes without breaking a sweat.

Backup and Restore

RDB Snapshots

ElastiCache for Valkey/Redis supports point-in-time snapshots using the RDB (Redis Database) format. The engine forks the process: child writes the entire dataset to a binary file while the parent keeps serving requests.

Feature	Detail
Automatic backups	Configurable retention period (0-35 days), daily during a specified backup window
Manual snapshots	On-demand, retained until explicitly deleted (no retention limit)
Snapshot storage	S3 (managed by ElastiCache, not visible in your S3 console)
Export to S3	Manual snapshots can be exported to your own S3 bucket
Restore	Restore a snapshot to create a new cluster (cannot restore into an existing cluster)
Cluster mode enabled	Snapshots are taken per-shard in parallel
Cross-region	Export to S3 in one region, restore in another for DR or migration

Memory overhead during snapshots: The fork uses copy-on-write semantics. Under heavy write load during a snapshot, the kernel duplicates modified memory pages, and memory usage can approach 2x the dataset size in the worst case. Run a node at 70% memory utilization, take a snapshot during a write-heavy period, and you risk hitting the memory ceiling. Evictions or OOM follow.

Keep baseline memory utilization below 50-60% on nodes taking automatic backups. Schedule backup windows during off-peak hours when write volume is lowest. I've seen teams learn this lesson only after a snapshot triggered evictions that cascaded into application errors. Not a fun page to get.

AOF Persistence

AOF (Append-Only File) persistence logs every write operation to disk. On restart, the engine replays the AOF to reconstruct the dataset. With appendfsync everysec, you lose at most 1 second of writes.

The trade-offs are real, though. AOF increases write latency (each write persists to disk), uses additional storage, and replaying a large AOF during recovery takes time. For pure caching workloads, RDB snapshots alone provide sufficient durability. AOF makes more sense when the cache stores data that's expensive to reconstruct: session state, computed aggregations, rate limiting counters. Losing even a few seconds of that data hurts.

Snapshot Export and Cross-Region Restore

ElastiCache lets you export manual snapshots to your own S3 bucket, which unlocks some useful patterns:

Cross-region disaster recovery: Snapshot in one region, replicate the S3 object to another, restore there
Environment seeding: Snapshot production, restore into staging with real data
Migration: Snapshot from an old cluster, restore into a new one with different node types or cluster mode
Long-term archival: Export snapshots to S3 for retention beyond the 35-day automatic backup limit

Security

Encryption In-Transit (TLS)

Enabling in-transit encryption activates TLS between clients and cache nodes, and between primary and replica nodes within the cluster. Sounds simple. A few things to know.

Aspect	Detail
Protocol	TLS 1.2 / TLS 1.3
Performance impact	~25% throughput reduction on smaller node types; less on larger Graviton nodes with hardware crypto
Client requirement	Client must connect using TLS (redis:// becomes rediss://)
Configuration	Must be set at cluster creation; cannot be enabled on existing unencrypted clusters

TLS overhead is real. On smaller node types (cache.t3, cache.m6g.medium), TLS handshakes and encryption eat noticeable CPU. Larger Graviton-based nodes have hardware-accelerated cryptographic engines that absorb most of the cost. Benchmark with TLS enabled before you finalize your node type. I've seen teams pick a node size based on non-TLS benchmarks, enable TLS in production, and immediately hit CPU saturation. Unpleasant surprise.

Encryption At-Rest

At-rest encryption uses AWS KMS to encrypt data on disk: RDB snapshots, AOF files, swap files. You pick the default AWS-managed key or a customer-managed CMK. Performance impact is negligible since hardware handles the encryption. Enable it for every production cluster. No reason not to.

Authentication Methods

Method	Engine Versions	Mechanism	Best For
Redis/Valkey AUTH	All	Static token (16-128 characters) sent with every connection	Simple setups, legacy applications
RBAC (Role-Based Access Control)	Redis OSS 6.x+, Valkey 7.2+	User groups with per-command and per-key access control	Multi-tenant clusters, principle of least privilege
IAM authentication	Redis OSS 7.0+, Valkey 7.2+	Short-lived IAM tokens (15-minute validity)	AWS-native applications, centralized access management

IAM authentication is what I recommend for any new deployment on Valkey 7.2+ or Redis OSS 7.0+. No AUTH tokens to manage or rotate. It plugs into your existing IAM policies and roles, and CloudTrail gives you audit trails. One catch: the 15-minute token validity means your client library needs automatic token refresh. Most modern Redis clients handle this natively.

AUTH token rotation: If you're using AUTH tokens, ElastiCache supports two rotation strategies. ROTATE adds a new token while keeping the old one valid, so you can do rolling updates across application instances without connection failures. SET immediately replaces the old token. Always use ROTATE in production. I once watched a team use SET during a token rotation. Instantly disconnected every application instance. The incident call was... educational.

VPC Placement and Security Groups

ElastiCache clusters must reside in a VPC. Keep security group rules tight:

Allow inbound traffic on port 6379 (Valkey/Redis) or 11211 (Memcached) only from security groups of application instances that need cache access
Don't open cache ports to 0.0.0.0/0 or broad CIDR ranges
Use separate security groups for different environments even if they share a VPC
Remember: encryption in-transit and authentication must both be set at cluster creation. You can't add them later. This one bites people.

Connection Management

Connection management causes more production incidents than people expect with ElastiCache. You need to understand the engine's connection model and your application's connection lifecycle to get this right.

Connection Limits Per Node Type

Each ElastiCache node has a maximum simultaneous connection count controlled by maxclients. Default is 65,000 for most node types, but the effective limit depends on available memory since each connection consumes memory for input/output buffers.

Node Family	Default maxclients	Practical Limit	Notes
cache.t3.micro	65,000	Hundreds	Memory severely limits effective connections
cache.t3.medium	65,000	Low thousands	Reasonable for dev/test
cache.m6g.large	65,000	Thousands	Comfortable for moderate production workloads
cache.r6g.xlarge	65,000	Tens of thousands	Per-connection memory overhead becomes negligible
cache.r6g.4xlarge+	65,000	Tens of thousands	Rarely connection-limited at this tier

Connection Pooling Strategies

Every application instance needs a connection pool. Creating connections per operation is wasteful, and the math on this is unforgiving. TCP handshake alone costs 1-3 ms. Add TLS and it's 3-5 ms. A Redis GET takes 0.1-0.3 ms. That's connection overhead running 10-50x the cost of the actual command.

Pool sizing guidance:

Factor	Recommendation
Connections per application instance	Start with 10-20, tune based on throughput needs
Total connections across fleet	Monitor CurrConnections; stay below 80% of maxclients
Idle timeout (client-side)	Set slightly below server-side `timeout` parameter
Connection validation	PING on borrow or reconnect-on-error
Maximum wait time	1-2 seconds; fail fast rather than queue indefinitely

Persistent Connections and DNS Caching

DNS caching causes the most production incidents during failovers. Here's how it plays out: failover happens, ElastiCache updates the primary endpoint's DNS record (5-second TTL) to point to the promoted replica. Your application caches DNS beyond that TTL, so it keeps sending writes to the old, failed node. Writes silently fail. Or worse, they hang.

Critical mitigations:

JVM applications: Set networkaddress.cache.ttl=5 in java.security or via java.security.Security.setProperty(). The JVM default caches DNS forever. Yes, forever. I'm not exaggerating.
Linux systems: Configure nscd with low TTL or disable it entirely.
Node.js: Default DNS cache honors TTL, but verify your HTTP/connection library does too.
Python: The socket module doesn't cache DNS by default, but connection pools hold stale connections.
All platforms: Test failover regularly using the TestFailover API. Do this before your first real failure. Not during it.

Client Library Selection

For Valkey/Redis, pick a cluster-aware client library. Here's what matters:

Capability	Why It Matters
Cluster mode (MOVED/ASK redirections)	Required for cluster mode enabled topologies
Automatic reconnection and retry	Handles transient failures and failovers gracefully
Connection pooling	Eliminates connection churn overhead
TLS support	Required when encryption in-transit is enabled
IAM token refresh	Required for IAM authentication (15-minute token expiry)
Pipelining	Batches commands for throughput optimization

Libraries I've had good results with: Lettuce (Java; non-blocking, cluster-aware, connection pooling built in), redis-py (Python), ioredis (Node.js), go-redis (Go).

Eviction Policies

When ElastiCache hits its memory limit (maxmemory), the eviction policy decides what happens next.

Policy	Eviction Target	Algorithm	Best For
noeviction	None; rejects writes when full	N/A	Session stores, rate limiters: data must not be silently dropped
allkeys-lru	Any key	Least Recently Used	General-purpose caching: the most common and safest default
volatile-lru	Keys with TTL set	Least Recently Used	Mixed workloads: persistent keys coexist with expiring cache entries
allkeys-lfu	Any key	Least Frequently Used	Workloads with clear hot/cold patterns: preserves popular keys
volatile-lfu	Keys with TTL set	Least Frequently Used	Mixed workloads with frequency-based eviction preference
volatile-ttl	Keys with shortest remaining TTL	TTL-based	When TTL should drive eviction priority
allkeys-random	Any key	Random	Uniform access patterns (rare in practice)
volatile-random	Keys with TTL set	Random	Rarely appropriate in production

When to Use Each

For pure caching (cache-aside pattern): allkeys-lru or allkeys-lfu. LFU is generally the better pick because it keeps frequently accessed keys even if their last access was a few seconds ago, but it requires Valkey/Redis 4.0+. LFU with the default frequency counter decay handles most workloads well.

For mixed use (cache + persistent data on the same cluster): volatile-lru or volatile-lfu. Set TTL on all cache keys. Don't set TTL on persistent data. The engine only evicts keys with a TTL, so your persistent data stays safe.

For session stores: noeviction. Silently evicting sessions means users get logged out with no explanation. Terrible user experience. Monitor DatabaseMemoryUsagePercentage aggressively and scale before memory fills.

For rate limiting: noeviction. If rate limit counters get evicted, rate limiting stops working. Your backend is now exposed to traffic storms.

For time-series data with natural expiration: volatile-ttl. Data with shorter remaining TTL (the oldest data) gets evicted first, preserving the most recent records.

Observability

Critical CloudWatch Metrics

ElastiCache publishes host-level and engine-level metrics to CloudWatch at 60-second intervals. Here are the ones I watch on every production deployment:

Metric	What It Measures	Alarm Threshold	Why It Matters
EngineCPUUtilization	CPU consumed by the Valkey/Redis engine thread	> 90%	The true engine saturation indicator for single-threaded execution
CPUUtilization	Total host CPU (includes OS, I/O threads, background tasks)	> 90%	Includes all processes, useful for overall node health
DatabaseMemoryUsagePercentage	Data memory as percentage of maxmemory	> 80%	Primary memory pressure indicator; above 80% is danger zone
CacheHitRate	cachehits / (cachehits + cache_misses)	< 80% (workload dependent)	Low hit rate means cache is ineffective; review TTLs, key design, or size
Evictions	Keys evicted per period due to memory pressure	> 0 (sustained)	Sustained evictions indicate memory pressure and potential data loss
CurrConnections	Current client connections	> 80% of maxclients	Approaching exhaustion causes connection refused errors
NewConnections	New connections per second	Abnormal spikes	Spikes indicate connection churn, likely from a pooling misconfiguration
ReplicationLag	Seconds replica is behind primary	> 5 seconds	High lag means data loss risk on failover and stale reads
SwapUsage	Swap space used in bytes	> 0	Any swap means memory pressure; cache nodes should never swap
NetworkBytesIn/Out	Network throughput	Approaching node type limit	Network saturation causes dropped connections and latency
SaveInProgress	Whether a snapshot is currently being taken	N/A (informational)	Snapshots cause memory spikes and potential latency increases

Here's the nuance that catches teams: EngineCPUUtilization and CPUUtilization tell you different things. A 4-core node might show 25% total CPUUtilization while EngineCPUUtilization sits at 90%. Engine thread is fully saturated. Three cores are idle. If you only alarm on CPUUtilization, you miss this completely. I've diagnosed production slowdowns where the on-call engineer kept saying "CPU looks fine" because they were looking at the wrong metric. Happens more than you'd think.

Slow Log

The Valkey/Redis slow log captures commands exceeding a configurable execution time threshold (set via slowlog-log-slower-than, default 10,000 microseconds / 10ms). Access it with SLOWLOG GET [count].

Check it regularly. It'll help you find:

HGETALL or SMEMBERS on large collections (thousands of fields/members)
SUNION, SINTER, SDIFF on large sets
KEYS commands (which scan the entire keyspace; never use in production, use SCAN instead)
Lua scripts with unexpectedly high execution time
SORT commands on large lists
DEL on large keys (use UNLINK instead for background deletion)

Alarms I Always Configure

Every production ElastiCache cluster I run gets these alarms at minimum:

DatabaseMemoryUsagePercentage > 75% (Warning). Investigate growth trend and plan scaling.
DatabaseMemoryUsagePercentage > 85% (Critical). Scale immediately.
EngineCPUUtilization > 80% (Warning). Review slow log, optimize expensive commands, or scale.
EngineCPUUtilization > 90% (Critical). Engine thread is saturated. Commands are queuing.
Evictions > 100 per 5 minutes (Warning). Memory pressure is causing data loss.
ReplicationLag > 5 seconds (Critical). Failover would lose multiple seconds of writes.
CurrConnections > 50,000 (Warning). Approaching the 65,000 default limit.
SwapUsage > 0 (Critical). The node is swapping. Performance is destroyed.

Cost Analysis

On-Demand vs. Reserved Pricing

ElastiCache node-based pricing follows the standard AWS model. Reserved instances save real money for predictable workloads, and you should be using them.

Node Type	vCPUs	Memory	Approx. On-Demand (us-east-1)	1-yr Reserved (All Upfront)	Savings
cache.t3.medium	2	3.09 GiB	~$0.068/hr	~$0.044/hr	~35%
cache.m6g.large	2	6.38 GiB	~$0.137/hr	~$0.087/hr	~37%
cache.r6g.large	2	13.07 GiB	~$0.166/hr	~$0.106/hr	~36%
cache.r6g.xlarge	4	26.32 GiB	~$0.332/hr	~$0.212/hr	~36%
cache.r6g.4xlarge	16	105.81 GiB	~$1.326/hr	~$0.847/hr	~36%
cache.r7g.large	2	13.07 GiB	~$0.174/hr	~$0.111/hr	~36%
cache.r6gd.xlarge	4	26.3 GiB + 118 GiB SSD	~$0.480/hr	~$0.307/hr	~36%
cache.r6g.16xlarge	64	419.7 GiB	~$6.567/hr	~$4.203/hr	~36%

Reserved node options:

Term	Payment Option	Approximate Savings vs. On-Demand
1-year	No Upfront	~25%
1-year	Partial Upfront	~30%
1-year	All Upfront	~36%
3-year	All Upfront	~48%

Serverless Pricing Comparison

For Valkey serverless (us-east-1):

Dimension	Rate
ECPUs	$0.0034 per million
Data storage	$0.125 per GB-hour
Minimum storage (Valkey)	100 MB (~$6/month idle)
Minimum storage (Redis OSS)	1 GB (~$90/month idle)

Cost crossover analysis: Where exactly provisioned beats serverless depends on your access patterns. In my experience running both, the crossover falls around $800-1,200/month of serverless spend for Valkey workloads. Below that, serverless wins on operational simplicity alone. Above that, provisioned with 1-year reserved instances typically costs less.

Cost Optimization Strategies

Use reserved instances for stable production workloads. Even 1-year no-upfront reservations save 25%.
Right-size node types. If DatabaseMemoryUsagePercentage stays below 50%, you're over-provisioned. Shrink the nodes.
Use Graviton-based instances (r7g, m7g). 20-28% better price-performance than Intel. There's no reason to run Intel for ElastiCache anymore.
Use data tiering for large datasets with hot/cold access patterns. 60%+ savings versus memory-only nodes add up fast.
Choose Valkey over Redis OSS. 20% lower node pricing, 33% lower serverless pricing, identical functionality.
Scale replicas based on actual read throughput. Don't default to 5 replicas per shard. Each replica is a full node on your bill.
Set TTLs on every cache key. Keys without TTLs grow indefinitely, silently inflating memory and cost.
Use cluster mode enabled for write scaling rather than upgrading to larger (and pricier) node types.

Common Failure Modes

Memory Pressure and OOM

The failure mode I see most often. When the dataset exceeds maxmemory, the engine either evicts keys (if eviction is configured) or rejects all writes with OOM errors (if noeviction). Extreme cases? The OS OOM killer terminates the process entirely.

Prevention:

Monitor DatabaseMemoryUsagePercentage and alarm at 75% (warning) and 85% (critical)
Account for snapshot memory overhead; forks temporarily double memory usage
Account for memory fragmentation; actual OS memory runs 10-20% higher than logical dataset size
Enable activedefrag yes to reduce fragmentation over time
Enable lazyfree-lazy-eviction yes to reduce latency spikes during eviction

Failover Detection Delay

After a primary node fails, applications keep sending requests to the dead node for 30-60 seconds while ElastiCache detects the failure and promotes a replica. Writes fail during this window. Users notice.

Mitigation:

Set client-side connection timeouts to 1-2 seconds
Implement retry logic with exponential backoff and jitter
Design your application to degrade gracefully when the cache is temporarily gone
Use circuit breakers to prevent cascading failures

DNS Caching During Failover

This one catches more teams off guard than any other failure mode. After failover, ElastiCache updates the primary endpoint's DNS record (5-second TTL) to point to the promoted replica. If your application caches DNS beyond that TTL, it keeps talking to the old node. Your app thinks the cache is down. In reality it's just talking to the wrong server.

Mitigation:

JVM: Set networkaddress.cache.ttl=5 (the default is infinite, which is astonishing)
Linux: Configure nscd with low TTL or disable it
Cluster mode enabled: Use cluster-aware clients that respond to MOVED redirections
Test failover regularly with the TestFailover API. Test it now, before you need it.

Large Keys Blocking Operations

Valkey/Redis uses single-threaded command execution. One command operating on a massive data structure blocks everything else until it's done. HGETALL on a 10 MB hash. DEL on a 100 MB sorted set. KEYS on a million-key keyspace. Any of these blocks the engine thread for tens or hundreds of milliseconds, causing timeouts across your entire application fleet.

Mitigation:

Use UNLINK instead of DEL for large keys. UNLINK reclaims memory in a background thread.
Use HSCAN, SSCAN, ZSCAN instead of HGETALL, SMEMBERS, ZRANGEBYLEX on large collections
Never use KEYS in production. Use SCAN with COUNT parameter.
Enable lazy free parameters: lazyfree-lazy-eviction, lazyfree-lazy-expire, lazyfree-lazy-server-del
Monitor the slow log for commands exceeding 5ms

Lua Script Timeouts

Lua scripts execute atomically. No other commands run until the script completes, so a runaway script blocks the entire shard. The lua-time-limit parameter (default 5,000ms / 5 seconds) controls when the engine starts accepting SCRIPT KILL commands. It doesn't automatically kill the script, though. You have to do that yourself.

Mitigation:

Keep Lua scripts short and bounded. No unbounded loops over large datasets.
Test scripts against production-scale data volumes in staging
Monitor EngineCPUUtilization for sustained 100% spikes that indicate a stuck script
Use EVALSHA (cached compiled script) instead of EVAL (inline script) for frequently executed scripts
Set a reasonable lua-time-limit and have runbooks for SCRIPT KILL

Cluster Mode Resharding Impact

During online resharding, the cluster migrates hash slots between shards. Keys in migrating slots experience higher latency, and multi-key operations spanning migrating and non-migrating slots fail with CROSSSLOT errors.

Mitigation:

Schedule resharding during low-traffic windows
Monitor latency metrics closely during the operation
Use hash tags to co-locate related keys, so they migrate together atomically
Avoid resharding during periods of high write throughput

Caching Patterns

Your caching pattern determines how data flows between application, cache, and database. Each one brings different consistency, latency, and complexity trade-offs.

Cache-Aside (Lazy Loading)

Most common pattern and generally the safest. Application checks the cache first. On a miss, it reads from the database, writes the result to cache with a TTL, and returns it. I use this as my default for every new project until I've got a specific reason to do otherwise.

Advantage	Disadvantage
Only requested data is cached (efficient memory use)	First request for each key incurs cache-miss latency (3 round trips)
Cache failure is non-fatal (application degrades to database reads)	Data can become stale until TTL expires or explicit invalidation
Simple to implement and reason about	Application code must handle miss/fill logic

flowchart TD
  A[Application
needs data] --> B{Check
Cache}
  B -->|Hit| C[Return
cached data]
  B -->|Miss| D[Read from
Database]
  D --> E[Write to Cache
with TTL]
  E --> F[Return data
to caller]

  style C fill:#2d7,stroke:#333
  style D fill:#e94,stroke:#333

Cache-aside (lazy loading) pattern

Write-Through

Every database write is immediately followed by a cache write. Once populated, reads always hit the cache.

Advantage	Disadvantage
Cache is always consistent with the database after writes	Write latency increases (two writes per operation)
Eliminates stale data for recently written keys	Caches data that may never be read (wasted memory)
Simplifies the read path	Cache availability affects write path

Write-Behind (Write-Back)

Application writes to the cache. The cache asynchronously writes to the database. The cache becomes your primary write target.

Advantage	Disadvantage
Lowest write latency (cache write only)	Data loss risk if cache fails before async write
Can batch and coalesce database writes	Complex to implement reliably
Absorbs write spikes, protects the database	Debugging data inconsistencies is difficult

Write-behind carries real risk. I only use it when the performance benefit justifies the complexity and occasional data loss is acceptable. Analytics counters and activity aggregations? Good candidates. Financial records? Absolutely not.

Read-Through

The cache itself loads data from the database on a miss. Application only talks to the cache. ElastiCache doesn't natively support read-through, so you'll need a caching library or application-side middleware to implement it.

TTL Strategies

Strategy	TTL Range	Use Case
Short TTL	5-60 seconds	Near-real-time data: stock prices, live scores, rapidly changing content
Medium TTL	5-30 minutes	Semi-dynamic data: user profiles, product listings, search results
Long TTL	1-24 hours	Slowly changing data: configuration, reference tables, CMS content
No TTL + explicit invalidation	Indefinite	Event-driven updates via pub/sub or application logic
Jittered TTL	Base +/- 10-20% random	Any pattern: prevents synchronized expiration across related keys

Set a TTL on every cache key. Even with explicit invalidation, a TTL acts as a safety net for when invalidation fails or gets delayed. I default to 1 hour for most workloads and adjust per key type based on how much staleness the application can tolerate.

Cache Stampede Prevention

A cache stampede (thundering herd) happens when a popular key expires and hundreds of concurrent requests simultaneously miss the cache and slam the database. I've been on the receiving end of these. They cause real outages.

Prevention strategies:

Strategy	Mechanism	Trade-off
Jittered TTLs	Add random offset (base TTL +/- 10-20%)	Simple but does not prevent stampede on individual hot keys
Lock-based recomputation	Acquire distributed lock (SET NX EX), recompute, release. Others wait or return stale value.	Prevents stampede but adds lock management complexity
Early recomputation	Background process refreshes cache before TTL expires	Key never actually expires under normal operation, but requires background infrastructure
Stale-while-revalidate	Return stale value immediately, refresh asynchronously	Best user experience but requires storing metadata alongside values

For high-traffic keys, I combine jittered TTLs with lock-based recomputation. Lock stops the stampede. Jitter prevents synchronized expiration across related keys. Together they handle the vast majority of scenarios.

Key Architectural Patterns

After years running ElastiCache in production, these are the patterns and decisions that matter most:

Choose Valkey for new deployments. API-compatible with Redis. Fully open source under the Linux Foundation. 20-33% cheaper. AWS is investing its development energy here.
Start with cluster mode enabled. Migrating from disabled to enabled means creating a new cluster. Start with enabled (even single-shard) and you get horizontal scalability without a migration event later.
Enable Multi-AZ with automatic failover for every production cluster. Cost: one replica per shard. Cost of skipping it: application degradation, database overload, customer impact during a cache outage. Not a close call.
Set a TTL on every key. Cache keys without TTLs accumulate silently until they trigger evictions or OOM. A TTL bounds staleness and manages memory simultaneously.
Monitor EngineCPUUtilization, not just CPUUtilization. For single-threaded Valkey/Redis, EngineCPUUtilization shows actual engine saturation. Overall CPUUtilization on a multi-core node hides a fully saturated engine thread behind averaged-out idle cores.
Fix DNS caching before your first failover. DNS caching during failover is the most common cause of extended outages after an ElastiCache node failure. Test failover with the TestFailover API in staging. Verify your application follows DNS changes within seconds.
Never use the KEYS command in production. Use SCAN with a COUNT parameter. KEYS blocks the engine thread for its entire execution, scanning every key. On busy clusters, this cascades into timeouts across every connected application.
Implement connection pooling from day one. Connection churn wastes resources on both sides, adds latency, and exhausts the node's connection limit during traffic spikes.
Use data tiering for large, infrequently-accessed datasets. The 60%+ cost savings on r6gd instances are real. ~300-microsecond SSD latency is acceptable for workloads with genuine hot/cold access patterns.
Design for cache failure. Your application must degrade gracefully when the cache is unavailable. If a cache outage becomes an application outage, you've got a design problem.

Additional Resources

Amazon ElastiCache Developer Guide: comprehensive reference for all ElastiCache features, configuration options, and API operations
ElastiCache for Valkey Getting Started Guide: walkthrough for creating and connecting to a Valkey cache in both serverless and node-based modes
Database Caching Strategies Using Redis: AWS whitepaper covering cache-aside, write-through, and other caching patterns with detailed implementation guidance
Best Practices for Sizing Your ElastiCache Clusters: AWS-published guidance on node type selection, memory sizing, and connection management
ElastiCache CloudWatch Metrics Reference: complete list of available metrics with descriptions and recommended alarm thresholds
ElastiCache Data Tiering Documentation: architecture details, supported node types, and performance characteristics for r6gd data tiering
ElastiCache Security Best Practices: encryption, authentication, RBAC, IAM integration, and network isolation guidance
Monitoring Best Practices with ElastiCache Using CloudWatch: detailed guidance on metric selection, alarm configuration, and observability patterns
Valkey Project Documentation: upstream engine documentation, command reference, and release notes at valkey.io
Amazon ElastiCache Pricing: current pricing for on-demand, reserved, and serverless across all regions and node types

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.

Get in Touch View Background LinkedIn