About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
ElastiCache looks easy. Deploy a managed cache, point your app at the endpoint, enjoy sub-millisecond reads. Then production happens. Engine selection, cluster topology, eviction policy, replication strategy, connection management, failover behavior: every one of these choices determines whether your caching layer holds up or collapses at 3 AM on a Saturday. I've spent years building and running ElastiCache clusters serving millions of requests per second. Some fronted relational databases with multi-terabyte datasets. Others were dead-simple session stores. All of them taught me something. Usually through failure first.
This is an architecture reference. If you're here to understand ElastiCache internals, scaling strategies, failure modes that blindside teams, and how to make sharp decisions about engines, topologies, and the newer serverless offering, you're in the right place.
What ElastiCache Actually Is
ElastiCache is AWS's fully managed in-memory caching service. Provisioning, patching, monitoring, failure detection, automatic failover, backup: AWS handles all of it. You focus on cache key design, eviction strategy, and wiring it into your application. Three engines: Valkey, Redis OSS, and Memcached.
It's changed a lot since 2011. Managed Memcached came first. Redis support landed in 2013, cluster mode enabled in 2016, encryption and RBAC between 2018 and 2020, data tiering on r6gd instances in 2021, ElastiCache Serverless in 2023, Valkey support in 2024. Every addition shifted how you should deploy and operate caching layers on AWS.
The Shift to Valkey
In March 2024, Redis Ltd. changed the Redis license from the permissive BSD 3-Clause to a dual SSPL/RSAL model. Cloud providers couldn't offer Redis as a managed service without a commercial agreement anymore. AWS, Google, Oracle, Ericsson, and the Linux Foundation responded by forking Redis 7.2.4 into Valkey (open-source, BSD license, under the Linux Foundation).
If you were around for the Elasticsearch/OpenSearch fork in 2021, same playbook. For ElastiCache users, the practical result is a transition. AWS now recommends Valkey for all new deployments. Pricing is 33% lower for serverless and 20% lower for node-based configurations versus Redis OSS. Valkey is wire-protocol compatible with Redis, so your existing client libraries, commands, and data structures work without modification. The Valkey project has already shipped its own innovations: I/O multi-threading in Valkey 8.0, performance improvements in 8.1, bloom filters, and vector search.
New project? Valkey. Already running Redis OSS clusters? Plan a migration but don't rush it. The transition is straightforward, and ElastiCache still supports Redis OSS.
Engine Comparison
| Feature | Valkey | Redis OSS | Memcached |
|---|---|---|---|
| License | BSD (Linux Foundation) | SSPL/RSAL (Redis Ltd.) | BSD |
| Data structures | Strings, lists, sets, sorted sets, hashes, streams, HyperLogLog, bitmaps, geospatial | Same as Valkey | Strings only |
| Max value size | 512 MB | 512 MB | 1 MB (default, configurable) |
| Persistence | RDB snapshots, AOF | RDB snapshots, AOF | None |
| Replication | Primary/replica with automatic failover | Primary/replica with automatic failover | None |
| Clustering | Hash slot sharding (16,384 slots) | Hash slot sharding (16,384 slots) | Client-side consistent hashing |
| Pub/Sub | Yes | Yes | No |
| Lua scripting | Yes | Yes | No |
| Transactions | MULTI/EXEC | MULTI/EXEC | CAS (check-and-set) only |
| Multi-threading | I/O multi-threading (Valkey 8.0+) | Single-threaded command execution | Fully multi-threaded |
| Serverless | Yes | Yes | No |
| Data tiering | Yes (r6gd nodes) | Yes (r6gd nodes) | No |
| ElastiCache pricing | 20-33% lower than Redis OSS | Baseline | Comparable to Redis OSS |
Valkey/Redis vs. Memcached is a quick decision. Need data structures, persistence, replication, pub/sub, scripting, or serverless? Valkey. Memcached still has one edge: its native multi-threaded architecture saturates all available CPU cores without cluster mode. That said, I pick Valkey for new deployments unless I've got a specific, benchmarked reason to do otherwise.
Redis (Valkey) vs. Memcached: A Deep Comparison
These engines differ at a fundamental level. You need to understand those differences to pick the right one and to know what you're signing up for operationally.
Threading and Execution Model
Valkey/Redis runs a single-threaded event loop for command execution. Every command (GET, SET, ZADD, Lua script) runs sequentially on one CPU core. No lock contention. Deterministic latency regardless of concurrency. Recent versions moved I/O operations (network read/write, disk persistence, memory deallocation) to background threads, and Valkey 8.0 pushed this further with I/O multi-threading. Network I/O now parallelizes while command execution stays single-threaded. Throughput gains are significant.
Memcached goes the other way entirely. A pool of worker threads handles connections and processes commands concurrently. Sixteen-core node? All 16 cores execute commands simultaneously. For simple GET/SET workloads on big instances, Memcached wins on raw throughput. The catch: that assumes you're CPU-bound rather than memory-bound or network-bound. In practice, that assumption holds less often than people think.
Data Structures and Complexity
Valkey/Redis has seven primary data structures with purpose-built commands and O(1) or O(log N) operations for common access patterns. Sorted sets give you leaderboards and range queries. Streams give you durable pub/sub with consumer groups. HyperLogLog does cardinality estimation in 12 KB. Bitmaps let you store per-user feature flags in megabytes instead of gigabytes.
Memcached supports strings. That's it. Serialize your data, store it, deserialize on read. No atomic operations beyond CAS (compare-and-swap), no range queries, no aggregations, no pub/sub.
Persistence and Durability
Valkey/Redis offers two persistence mechanisms: RDB snapshots (point-in-time binary dumps) and AOF (append-only file logging). Together they let you recover from node restarts with minimal data loss. ElastiCache automates snapshot management with configurable retention up to 35 days.
Memcached has no persistence. None. Node restart means total data loss. Node failure permanently destroys every key on that node, and there's no recovery mechanism.
Replication and High Availability
Valkey/Redis supports primary/replica replication with automatic failover. Each shard can hold up to five read replicas distributed across Availability Zones. Primary fails? ElastiCache promotes a replica in 30-60 seconds.
Memcached has no replication. Each key lives on exactly one node. That node fails, every key on it vanishes instantly, and the resulting cache miss storm hammers your database.
Lua Scripting
Valkey/Redis supports server-side Lua scripting that executes atomically. Read a value, compute something, write a result: one atomic operation, zero race conditions. I've used this for rate limiting with sliding windows, conditional updates, and multi-step transactions that would otherwise need distributed locks.
Memcached? No scripting capability at all.
Architecture Internals
You need to understand how ElastiCache clusters are built before anything else. Topology, sizing, and resilience decisions all flow from this.
Cluster Components
| Component | Purpose | Key Considerations |
|---|---|---|
| Nodes | Individual cache instances running on EC2 | Choose node type based on memory, CPU, and network requirements |
| Shards (node groups) | A primary node + 0-5 replicas | Each shard holds a subset of the keyspace (cluster mode enabled) or the full keyspace (cluster mode disabled) |
| Replication group | Collection of shards forming a logical cluster | The unit of failover, backup, and scaling operations |
| Parameter group | Engine configuration settings | Controls eviction policy, memory limits, timeouts, slow log thresholds, and feature flags |
| Subnet group | VPC subnet placement | Determines AZ distribution and network isolation for cache nodes |
| Security group | Network access control | Controls which resources can connect to the cache on port 6379 (Valkey/Redis) or 11211 (Memcached) |
Node Types
ElastiCache node types follow the EC2 naming convention with a cache. prefix. Getting the family right is one of the highest-leverage decisions you'll make on cost and performance.
| Family | Processor | Optimized For | Production Use Case |
|---|---|---|---|
| cache.t3 / cache.t4g | Intel / Graviton2 | Burstable CPU | Development, testing, small workloads with variable traffic |
| cache.m6g / cache.m7g | Graviton2 / Graviton3 | Balanced compute and memory | General-purpose caching where CPU and memory needs are balanced |
| cache.r6g / cache.r7g | Graviton2 / Graviton3 | Memory-optimized | Large datasets, high memory-to-CPU ratio workloads |
| cache.r6gd | Graviton2 + NVMe SSD | Data tiering (memory + SSD) | Very large datasets with hot/cold access patterns |
| cache.c7gn | Graviton3 + enhanced networking | Network-optimized | Extremely high throughput requirements |
R7g nodes deliver up to 28% more throughput and 21% better P99 latency over R6g, plus 25% higher networking bandwidth. Graviton-based instances consistently run 20-30% better price-performance than Intel equivalents.
Sizing advice: start with cache.r6g.large or cache.r7g.large for most production workloads. The r-family gives you the most memory per dollar, and memory is almost always the binding constraint. Go with m-family only when your workload is genuinely CPU-bound (heavy Lua scripting, complex sorted set operations). T-family? Dev and test only. Burstable instances will surprise you with latency spikes once CPU credits deplete. I learned that one the hard way during a load test that looked great for two hours, then fell off a cliff.
Parameter Groups
Parameter groups control engine-level behavior. Defaults work fine for development. Production needs tuning.
| Parameter | Default | Production Recommendation | Rationale |
|---|---|---|---|
| maxmemory-policy | volatile-lru | Depends on use case (see Eviction Policies) | Controls behavior when memory is exhausted |
| timeout | 0 (disabled) | 300-600 seconds | Prevents connection leaks from crashed clients |
| tcp-keepalive | 300 | 60-120 seconds | Detects dead connections faster |
| activedefrag | no | yes | Reduces memory fragmentation on long-running clusters |
| lazyfree-lazy-eviction | no | yes | Moves eviction memory reclamation to background thread |
| lazyfree-lazy-expire | no | yes | Moves TTL expiration memory reclamation to background thread |
| lazyfree-lazy-server-del | no | yes | Moves server-initiated DEL to background thread |
| slowlog-log-slower-than | 10000 (10ms) | 5000 (5ms) | Catch slow operations earlier |
| notify-keyspace-events | "" (disabled) | Enable only if needed | Keyspace notifications add CPU and memory overhead |
Subnet Groups and Network Placement
ElastiCache clusters run inside your VPC. No public endpoint, period. A subnet group defines which subnets (and therefore which AZs) receive cache nodes.
For production clusters with Multi-AZ enabled, include subnets in at least two AZs. I prefer three. ElastiCache distributes primary and replica nodes across available AZs automatically.
Your applications must be in the same VPC, a peered VPC, or connected via Transit Gateway to reach ElastiCache. No internet-facing endpoint. No NAT Gateway path. No VPC endpoint like DynamoDB or S3 have. Plan your network topology accordingly.
Cluster Mode: Disabled vs. Enabled
One of the most consequential architectural decisions you'll make with ElastiCache for Valkey/Redis. Also one of the hardest to reverse. There's no in-place migration from cluster mode disabled to enabled. You have to create a new cluster and migrate data. I've watched teams learn this the painful way.
Cluster Mode Disabled
Cluster mode disabled means a single shard: one primary node and up to five read replicas. Every node holds the entire dataset.
| Characteristic | Detail |
|---|---|
| Shards | 1 |
| Max replicas | 5 |
| Max memory | Limited to a single node (up to ~419 GiB on cache.r7g.16xlarge) |
| Read scaling | Add read replicas (up to 5), reader endpoint load-balances across them |
| Write scaling | Vertical only: upgrade to a larger node type |
| Multi-key operations | Unrestricted: MGET, MSET, transactions, Lua scripts work across all keys |
| Endpoints | Primary endpoint (writes) + Reader endpoint (reads) |
| Client requirements | Any Redis client; no cluster awareness needed |
Simpler to operate. Makes sense when your dataset fits comfortably in a single node with growth headroom and your write throughput stays within what one primary can handle.
Cluster Mode Enabled
Cluster mode enabled partitions the keyspace across multiple shards using 16,384 hash slots. Each key maps to a slot via CRC16(key) mod 16384. Each shard owns a contiguous range of those slots.
| Characteristic | Detail |
|---|---|
| Shards | 1 to 500 |
| Max replicas per shard | 5 |
| Max nodes per cluster | 500 (e.g., 83 shards x 6 nodes, or 500 shards x 1 primary each) |
| Max memory | Sum of all shard memory (petabyte-scale with data tiering) |
| Read scaling | Add replicas within each shard |
| Write scaling | Add more shards (horizontal scaling via online resharding) |
| Multi-key operations | Only for keys in the same hash slot; use hash tags {tag} to co-locate |
| Endpoints | Configuration endpoint (cluster-aware clients required) |
| Client requirements | Cluster-aware client (handles MOVED/ASK redirections) |
Hash Slot Distribution and Hash Tags
ElastiCache distributes the 16,384 hash slots across shards. Three shards? Each shard typically owns roughly 5,461 slots. Rebalancing tries to distribute slots evenly, but keys aren't uniform in size. Uniform slot distribution doesn't guarantee uniform memory distribution.
Hash tags are critical for multi-key operations in cluster mode. If you need MGET, transactions, or Lua scripts across related keys, those keys must hash to the same slot. Force this by including a common substring in curly braces:
user:{12345}:profileanduser:{12345}:preferencesboth hash based on12345- They will always land on the same shard, enabling atomic multi-key operations
A mistake I see over and over: teams migrate to cluster mode enabled without planning hash tags. Applications that use MGET across unrelated keys, Lua scripts touching keys on different shards, transactions spanning arbitrary keys. They all break with CROSSSLOT errors. Silent, sudden, and completely avoidable with planning.
Resharding Operations
Cluster mode enabled lets you add or remove shards (online resharding) and rebalance hash slots while the cluster serves traffic. Resharding has real operational costs, though.
| Aspect | Impact |
|---|---|
| Availability | Cluster remains available but migrating keys may experience elevated latency |
| Duration | Proportional to data volume being migrated; large shards take longer |
| Configuration changes | Cannot process other configuration changes during resharding |
| Multi-key operations | Commands spanning migrating and non-migrating slots may fail during migration |
| Memory overhead | Source and destination shards temporarily hold copies of migrating data |
Here's what I tell every team: if there's any chance your dataset will outgrow a single node, start with cluster mode enabled from day one. Even a single-shard cluster mode enabled configuration preserves the option to add shards later. Save yourself the painful migration event down the road.
flowchart TB
subgraph CMD["Cluster Mode Disabled"]
direction TB
P1[Primary Node
Full Dataset] --> R1[Replica 1]
P1 --> R2[Replica 2]
CE1[Primary Endpoint] --> P1
CE2[Reader Endpoint] --> R1
CE2 --> R2
end
subgraph CME["Cluster Mode Enabled"]
direction TB
subgraph S1["Shard 1
Slots 0-5460"]
P3[Primary] --> R3[Replica]
end
subgraph S2["Shard 2
Slots 5461-10922"]
P4[Primary] --> R4[Replica]
end
subgraph S3["Shard 3
Slots 10923-16383"]
P5[Primary] --> R5[Replica]
end
CE3[Configuration
Endpoint] --> S1
CE3 --> S2
CE3 --> S3
end Replication and High Availability
Primary/Replica Architecture
Each shard in a Valkey/Redis replication group has one primary node and zero to five replicas. The primary handles all writes and propagates changes to replicas asynchronously. Replicas serve read traffic and stand ready for promotion if the primary fails.
Replication is async by default. Under normal conditions, lag is sub-millisecond. But there's always a window where a replica hasn't received the latest writes yet. After a failover, a small number of recent writes are gone. If your application can't tolerate any data loss for specific data, that data belongs in a different persistence layer. Full stop.
Multi-AZ with Automatic Failover
Enable Multi-AZ for every production cluster. No qualifications on this one. ElastiCache distributes primary and replica nodes across Availability Zones and automatically promotes a replica when the primary goes down.
The failover sequence looks like this:
- ElastiCache detects the primary node is unhealthy (typically within 30-60 seconds)
- It selects the replica with the least replication lag
- The selected replica is promoted to primary
- DNS is updated to point the primary endpoint to the new primary
- Other replicas begin replicating from the new primary
- A replacement replica is provisioned in the original AZ
Total failover time runs 30-60 seconds in most cases. It can stretch longer under heavy load or with large datasets on the promoted replica. During failover, writes to the affected shard fail. Reads from replicas keep working.
sequenceDiagram participant App as Application participant EP as Primary Endpoint participant P as Primary Node AZ-1 participant R as Replica Node AZ-2 Note over P: Primary fails App->>EP: Write request EP-->>App: Connection error rect rgb(255,230,230) Note over EP,R: Failover (30-60 seconds) R->>R: Promoted to Primary EP->>EP: DNS updated to AZ-2 end App->>EP: Retry write EP->>R: Route to new Primary R-->>App: Success Note over P: Replacement replica
provisioned in AZ-1
Replica Lag Monitoring
Replica lag is the single most important replication health metric. It measures the delay between a write hitting the primary and that write getting applied on a replica. Watch the ReplicationLag CloudWatch metric for every replica node. Don't skip this.
| Lag Range | Interpretation | Action |
|---|---|---|
| < 1 second | Normal operation | No action needed |
| 1-5 seconds | Elevated; possible network congestion or heavy write load | Investigate primary write throughput and network |
| 5-10 seconds | Warning; failover would lose multiple seconds of writes | Scale up node type or reduce write volume |
| > 10 seconds | Critical; significant data loss risk on failover | Immediate investigation required |
Persistent high replica lag means the primary's write throughput exceeds what the replica can keep up with. Common culprits: sustained write bursts, large key operations, Lua scripts generating tons of writes, slow cross-AZ network.
Read Replicas for Read Scaling
Read replicas do two things. They're failover targets for high availability, and they scale read throughput. The reader endpoint load-balances read connections across all replicas in a shard.
For read-heavy workloads (80%+ reads), adding replicas cuts EngineCPUUtilization on the primary by offloading read traffic. Each replica handles reads independently on its own engine thread, so read throughput scales linearly with replica count.
Replicas cost money, though. Each one is a full node on your bill. If your workload is write-heavy and replicas sit underutilized for reads, you're paying for capacity that only provides HA value. One replica per shard is enough for failover. Add more only when measured read throughput actually demands it.
ElastiCache Serverless
ElastiCache Serverless launched in late 2023 with a fundamentally different operational model. No provisioning nodes, no selecting cluster topologies, no managing scaling policies. You create a serverless cache. ElastiCache handles capacity management.
Architecture
When you create a serverless cache, ElastiCache provisions multi-AZ, replicated infrastructure behind the scenes. You get a single endpoint. Compute and storage scale independently based on workload. No manual intervention.
You still specify VPC and subnet configurations for network isolation. Everything else (node management, AZ distribution, failover, patching, capacity planning) happens transparently.
| Aspect | Serverless | Node-Based (Provisioned) |
|---|---|---|
| Capacity planning | Automatic | Manual: you choose node types and count |
| Scaling | Automatic, scales in minutes | Manual: add/remove nodes or shards |
| High availability | Built-in Multi-AZ, automatic | Must enable Multi-AZ and configure replicas |
| Patching | Automatic, zero downtime | Requires maintenance windows |
| Pricing model | Pay per ECPU + storage (GB-hours) | Pay per node-hour (on-demand or reserved) |
| Configuration | Limited tuning knobs | Full parameter group control |
| Max throughput | 5 million requests/second per cache | Depends on cluster sizing |
| Engine support | Valkey, Redis OSS | Valkey, Redis OSS, Memcached |
| Data tiering | Not available | Available on r6gd nodes |
| Setup time | Under 1 minute | 5-15 minutes |
ECPU Pricing Model
Serverless pricing has two dimensions.
ElastiCache Processing Units (ECPUs): Each kilobyte of data transferred (reads and writes) costs 1 ECPU. A GET returning a 3 KB value? 3 ECPUs. Simple commands on small values cost 1 ECPU.
Data storage: Charged per GB-hour based on the hourly average of data stored. Valkey serverless has a minimum data storage of 100 MB (90% lower than the 1 GB minimum for Redis OSS), which sets the floor for storage costs.
| Dimension | Approximate Rate (us-east-1) |
|---|---|
| ECPUs (Valkey) | $0.0034 per million ECPUs |
| Data storage (Valkey) | $0.125 per GB-hour |
| Minimum storage (Valkey) | 100 MB |
| Minimum storage (Redis OSS) | 1 GB |
An idle Valkey serverless cache runs about $6/month.
When to Use Serverless vs. Provisioned
Choose serverless when:
- Your workload is unpredictable, spiky, or highly variable
- You want to minimize operational overhead and skip capacity planning
- You're early-stage and don't know steady-state traffic patterns yet
- Your team doesn't have deep ElastiCache operational experience
- Time to production matters more than cost optimization
Choose node-based (provisioned) when:
- Your workload is predictable and reserved instance pricing (up to 48% savings) applies
- You need fine-grained control over engine parameters
- You need data tiering (r6gd instances with SSD)
- Cost optimization at scale is the priority
- You require Memcached
- You need specific node placement across AZs
My take: start with serverless for new workloads. Monitor ECPU consumption and storage for a few weeks. If monthly spend passes $800-1,200, run the numbers on provisioned clusters with 1-year reserved instances. The crossover point is real, and I've hit it on multiple projects.
Data Tiering
Data tiering on cache.r6gd node types extends ElastiCache storage beyond DRAM by adding locally attached NVMe SSDs. For large datasets, it's one of the most impactful cost optimizations available.
Memory + SSD Architecture
The r6gd data tiering architecture is straightforward:
- Keys always stay in DRAM. Only values are candidates for tiering to SSD.
- Hot data stays in DRAM. ElastiCache tracks every item's last access time and keeps frequently accessed data in memory.
- Cold data migrates to SSD. When DRAM fills up, an LRU algorithm moves infrequently accessed values to local NVMe SSD storage.
- SSD reads are transparent. When an application requests a value sitting on SSD, ElastiCache reads it, moves it back to DRAM asynchronously, and returns it to the client.
Added latency for SSD-resident data averages about 300 microseconds (assuming 500-byte string values). That's higher than sub-100-microsecond DRAM latency, sure. Still acceptable for most applications by a wide margin.
Capacity and Cost
| Node Type | DRAM | SSD | Total Capacity | Approx. On-Demand (us-east-1) |
|---|---|---|---|---|
| cache.r6gd.xlarge | 26.3 GiB | 118 GiB | ~144 GiB | ~$0.48/hr |
| cache.r6gd.2xlarge | 52.8 GiB | 237 GiB | ~290 GiB | ~$0.96/hr |
| cache.r6gd.4xlarge | 105.8 GiB | 474 GiB | ~580 GiB | ~$1.92/hr |
| cache.r6gd.8xlarge | 211.1 GiB | 949 GiB | ~1,160 GiB | ~$3.84/hr |
| cache.r6gd.16xlarge | 419.7 GiB | 1,897 GiB | ~2,317 GiB | ~$7.67/hr |
R6gd nodes offer 4.8x more total capacity (memory + SSD) than equivalent r6g nodes, with over 60% cost savings at maximum utilization. The largest node in a 500-node cluster mode enabled configuration? Up to 1 petabyte (500 TB with one read replica per shard).
Security on Data Tiering Nodes
All Graviton2-based nodes include always-on 256-bit encrypted DRAM. Items stored on NVMe SSDs get encrypted by default using XTS-AES-256 block cipher in a hardware module on the node. This encryption is active even if you didn't explicitly enable encryption at rest. It's a hardware-level feature of the r6gd platform.
When Data Tiering Makes Sense
Data tiering works well when your dataset is large (hundreds of GiB to TBs), only 10-20% of the data gets regular access, and you can tolerate the extra ~300 microsecond latency on cold reads.
Skip it when every read needs consistent sub-millisecond latency, when your access pattern is uniformly random with no hot/cold separation, or when your dataset fits in memory on standard nodes without breaking a sweat.
Backup and Restore
RDB Snapshots
ElastiCache for Valkey/Redis supports point-in-time snapshots using the RDB (Redis Database) format. The engine forks the process: child writes the entire dataset to a binary file while the parent keeps serving requests.
| Feature | Detail |
|---|---|
| Automatic backups | Configurable retention period (0-35 days), daily during a specified backup window |
| Manual snapshots | On-demand, retained until explicitly deleted (no retention limit) |
| Snapshot storage | S3 (managed by ElastiCache, not visible in your S3 console) |
| Export to S3 | Manual snapshots can be exported to your own S3 bucket |
| Restore | Restore a snapshot to create a new cluster (cannot restore into an existing cluster) |
| Cluster mode enabled | Snapshots are taken per-shard in parallel |
| Cross-region | Export to S3 in one region, restore in another for DR or migration |
Memory overhead during snapshots: The fork uses copy-on-write semantics. Under heavy write load during a snapshot, the kernel duplicates modified memory pages, and memory usage can approach 2x the dataset size in the worst case. Run a node at 70% memory utilization, take a snapshot during a write-heavy period, and you risk hitting the memory ceiling. Evictions or OOM follow.
Keep baseline memory utilization below 50-60% on nodes taking automatic backups. Schedule backup windows during off-peak hours when write volume is lowest. I've seen teams learn this lesson only after a snapshot triggered evictions that cascaded into application errors. Not a fun page to get.
AOF Persistence
AOF (Append-Only File) persistence logs every write operation to disk. On restart, the engine replays the AOF to reconstruct the dataset. With appendfsync everysec, you lose at most 1 second of writes.
The trade-offs are real, though. AOF increases write latency (each write persists to disk), uses additional storage, and replaying a large AOF during recovery takes time. For pure caching workloads, RDB snapshots alone provide sufficient durability. AOF makes more sense when the cache stores data that's expensive to reconstruct: session state, computed aggregations, rate limiting counters. Losing even a few seconds of that data hurts.
Snapshot Export and Cross-Region Restore
ElastiCache lets you export manual snapshots to your own S3 bucket, which unlocks some useful patterns:
- Cross-region disaster recovery: Snapshot in one region, replicate the S3 object to another, restore there
- Environment seeding: Snapshot production, restore into staging with real data
- Migration: Snapshot from an old cluster, restore into a new one with different node types or cluster mode
- Long-term archival: Export snapshots to S3 for retention beyond the 35-day automatic backup limit
Security
Encryption In-Transit (TLS)
Enabling in-transit encryption activates TLS between clients and cache nodes, and between primary and replica nodes within the cluster. Sounds simple. A few things to know.
| Aspect | Detail |
|---|---|
| Protocol | TLS 1.2 / TLS 1.3 |
| Performance impact | ~25% throughput reduction on smaller node types; less on larger Graviton nodes with hardware crypto |
| Client requirement | Client must connect using TLS (redis:// becomes rediss://) |
| Configuration | Must be set at cluster creation; cannot be enabled on existing unencrypted clusters |
TLS overhead is real. On smaller node types (cache.t3, cache.m6g.medium), TLS handshakes and encryption eat noticeable CPU. Larger Graviton-based nodes have hardware-accelerated cryptographic engines that absorb most of the cost. Benchmark with TLS enabled before you finalize your node type. I've seen teams pick a node size based on non-TLS benchmarks, enable TLS in production, and immediately hit CPU saturation. Unpleasant surprise.
Encryption At-Rest
At-rest encryption uses AWS KMS to encrypt data on disk: RDB snapshots, AOF files, swap files. You pick the default AWS-managed key or a customer-managed CMK. Performance impact is negligible since hardware handles the encryption. Enable it for every production cluster. No reason not to.
Authentication Methods
| Method | Engine Versions | Mechanism | Best For |
|---|---|---|---|
| Redis/Valkey AUTH | All | Static token (16-128 characters) sent with every connection | Simple setups, legacy applications |
| RBAC (Role-Based Access Control) | Redis OSS 6.x+, Valkey 7.2+ | User groups with per-command and per-key access control | Multi-tenant clusters, principle of least privilege |
| IAM authentication | Redis OSS 7.0+, Valkey 7.2+ | Short-lived IAM tokens (15-minute validity) | AWS-native applications, centralized access management |
IAM authentication is what I recommend for any new deployment on Valkey 7.2+ or Redis OSS 7.0+. No AUTH tokens to manage or rotate. It plugs into your existing IAM policies and roles, and CloudTrail gives you audit trails. One catch: the 15-minute token validity means your client library needs automatic token refresh. Most modern Redis clients handle this natively.
AUTH token rotation: If you're using AUTH tokens, ElastiCache supports two rotation strategies. ROTATE adds a new token while keeping the old one valid, so you can do rolling updates across application instances without connection failures. SET immediately replaces the old token. Always use ROTATE in production. I once watched a team use SET during a token rotation. Instantly disconnected every application instance. The incident call was... educational.
VPC Placement and Security Groups
ElastiCache clusters must reside in a VPC. Keep security group rules tight:
- Allow inbound traffic on port 6379 (Valkey/Redis) or 11211 (Memcached) only from security groups of application instances that need cache access
- Don't open cache ports to 0.0.0.0/0 or broad CIDR ranges
- Use separate security groups for different environments even if they share a VPC
- Remember: encryption in-transit and authentication must both be set at cluster creation. You can't add them later. This one bites people.
Connection Management
Connection management causes more production incidents than people expect with ElastiCache. You need to understand the engine's connection model and your application's connection lifecycle to get this right.
Connection Limits Per Node Type
Each ElastiCache node has a maximum simultaneous connection count controlled by maxclients. Default is 65,000 for most node types, but the effective limit depends on available memory since each connection consumes memory for input/output buffers.
| Node Family | Default maxclients | Practical Limit | Notes |
|---|---|---|---|
| cache.t3.micro | 65,000 | Hundreds | Memory severely limits effective connections |
| cache.t3.medium | 65,000 | Low thousands | Reasonable for dev/test |
| cache.m6g.large | 65,000 | Thousands | Comfortable for moderate production workloads |
| cache.r6g.xlarge | 65,000 | Tens of thousands | Per-connection memory overhead becomes negligible |
| cache.r6g.4xlarge+ | 65,000 | Tens of thousands | Rarely connection-limited at this tier |
Connection Pooling Strategies
Every application instance needs a connection pool. Creating connections per operation is wasteful, and the math on this is unforgiving. TCP handshake alone costs 1-3 ms. Add TLS and it's 3-5 ms. A Redis GET takes 0.1-0.3 ms. That's connection overhead running 10-50x the cost of the actual command.
Pool sizing guidance:
| Factor | Recommendation |
|---|---|
| Connections per application instance | Start with 10-20, tune based on throughput needs |
| Total connections across fleet | Monitor CurrConnections; stay below 80% of maxclients |
| Idle timeout (client-side) | Set slightly below server-side timeout parameter |
| Connection validation | PING on borrow or reconnect-on-error |
| Maximum wait time | 1-2 seconds; fail fast rather than queue indefinitely |
Persistent Connections and DNS Caching
DNS caching causes the most production incidents during failovers. Here's how it plays out: failover happens, ElastiCache updates the primary endpoint's DNS record (5-second TTL) to point to the promoted replica. Your application caches DNS beyond that TTL, so it keeps sending writes to the old, failed node. Writes silently fail. Or worse, they hang.
Critical mitigations:
- JVM applications: Set
networkaddress.cache.ttl=5injava.securityor viajava.security.Security.setProperty(). The JVM default caches DNS forever. Yes, forever. I'm not exaggerating. - Linux systems: Configure nscd with low TTL or disable it entirely.
- Node.js: Default DNS cache honors TTL, but verify your HTTP/connection library does too.
- Python: The
socketmodule doesn't cache DNS by default, but connection pools hold stale connections. - All platforms: Test failover regularly using the
TestFailoverAPI. Do this before your first real failure. Not during it.
Client Library Selection
For Valkey/Redis, pick a cluster-aware client library. Here's what matters:
| Capability | Why It Matters |
|---|---|
| Cluster mode (MOVED/ASK redirections) | Required for cluster mode enabled topologies |
| Automatic reconnection and retry | Handles transient failures and failovers gracefully |
| Connection pooling | Eliminates connection churn overhead |
| TLS support | Required when encryption in-transit is enabled |
| IAM token refresh | Required for IAM authentication (15-minute token expiry) |
| Pipelining | Batches commands for throughput optimization |
Libraries I've had good results with: Lettuce (Java; non-blocking, cluster-aware, connection pooling built in), redis-py (Python), ioredis (Node.js), go-redis (Go).
Eviction Policies
When ElastiCache hits its memory limit (maxmemory), the eviction policy decides what happens next.
| Policy | Eviction Target | Algorithm | Best For |
|---|---|---|---|
| noeviction | None; rejects writes when full | N/A | Session stores, rate limiters: data must not be silently dropped |
| allkeys-lru | Any key | Least Recently Used | General-purpose caching: the most common and safest default |
| volatile-lru | Keys with TTL set | Least Recently Used | Mixed workloads: persistent keys coexist with expiring cache entries |
| allkeys-lfu | Any key | Least Frequently Used | Workloads with clear hot/cold patterns: preserves popular keys |
| volatile-lfu | Keys with TTL set | Least Frequently Used | Mixed workloads with frequency-based eviction preference |
| volatile-ttl | Keys with shortest remaining TTL | TTL-based | When TTL should drive eviction priority |
| allkeys-random | Any key | Random | Uniform access patterns (rare in practice) |
| volatile-random | Keys with TTL set | Random | Rarely appropriate in production |
When to Use Each
For pure caching (cache-aside pattern): allkeys-lru or allkeys-lfu. LFU is generally the better pick because it keeps frequently accessed keys even if their last access was a few seconds ago, but it requires Valkey/Redis 4.0+. LFU with the default frequency counter decay handles most workloads well.
For mixed use (cache + persistent data on the same cluster): volatile-lru or volatile-lfu. Set TTL on all cache keys. Don't set TTL on persistent data. The engine only evicts keys with a TTL, so your persistent data stays safe.
For session stores: noeviction. Silently evicting sessions means users get logged out with no explanation. Terrible user experience. Monitor DatabaseMemoryUsagePercentage aggressively and scale before memory fills.
For rate limiting: noeviction. If rate limit counters get evicted, rate limiting stops working. Your backend is now exposed to traffic storms.
For time-series data with natural expiration: volatile-ttl. Data with shorter remaining TTL (the oldest data) gets evicted first, preserving the most recent records.
Observability
Critical CloudWatch Metrics
ElastiCache publishes host-level and engine-level metrics to CloudWatch at 60-second intervals. Here are the ones I watch on every production deployment:
| Metric | What It Measures | Alarm Threshold | Why It Matters |
|---|---|---|---|
| EngineCPUUtilization | CPU consumed by the Valkey/Redis engine thread | > 90% | The true engine saturation indicator for single-threaded execution |
| CPUUtilization | Total host CPU (includes OS, I/O threads, background tasks) | > 90% | Includes all processes, useful for overall node health |
| DatabaseMemoryUsagePercentage | Data memory as percentage of maxmemory | > 80% | Primary memory pressure indicator; above 80% is danger zone |
| CacheHitRate | cachehits / (cachehits + cache_misses) | < 80% (workload dependent) | Low hit rate means cache is ineffective; review TTLs, key design, or size |
| Evictions | Keys evicted per period due to memory pressure | > 0 (sustained) | Sustained evictions indicate memory pressure and potential data loss |
| CurrConnections | Current client connections | > 80% of maxclients | Approaching exhaustion causes connection refused errors |
| NewConnections | New connections per second | Abnormal spikes | Spikes indicate connection churn, likely from a pooling misconfiguration |
| ReplicationLag | Seconds replica is behind primary | > 5 seconds | High lag means data loss risk on failover and stale reads |
| SwapUsage | Swap space used in bytes | > 0 | Any swap means memory pressure; cache nodes should never swap |
| NetworkBytesIn/Out | Network throughput | Approaching node type limit | Network saturation causes dropped connections and latency |
| SaveInProgress | Whether a snapshot is currently being taken | N/A (informational) | Snapshots cause memory spikes and potential latency increases |
Here's the nuance that catches teams: EngineCPUUtilization and CPUUtilization tell you different things. A 4-core node might show 25% total CPUUtilization while EngineCPUUtilization sits at 90%. Engine thread is fully saturated. Three cores are idle. If you only alarm on CPUUtilization, you miss this completely. I've diagnosed production slowdowns where the on-call engineer kept saying "CPU looks fine" because they were looking at the wrong metric. Happens more than you'd think.
Slow Log
The Valkey/Redis slow log captures commands exceeding a configurable execution time threshold (set via slowlog-log-slower-than, default 10,000 microseconds / 10ms). Access it with SLOWLOG GET [count].
Check it regularly. It'll help you find:
- HGETALL or SMEMBERS on large collections (thousands of fields/members)
- SUNION, SINTER, SDIFF on large sets
- KEYS commands (which scan the entire keyspace; never use in production, use SCAN instead)
- Lua scripts with unexpectedly high execution time
- SORT commands on large lists
- DEL on large keys (use UNLINK instead for background deletion)
Alarms I Always Configure
Every production ElastiCache cluster I run gets these alarms at minimum:
- DatabaseMemoryUsagePercentage > 75% (Warning). Investigate growth trend and plan scaling.
- DatabaseMemoryUsagePercentage > 85% (Critical). Scale immediately.
- EngineCPUUtilization > 80% (Warning). Review slow log, optimize expensive commands, or scale.
- EngineCPUUtilization > 90% (Critical). Engine thread is saturated. Commands are queuing.
- Evictions > 100 per 5 minutes (Warning). Memory pressure is causing data loss.
- ReplicationLag > 5 seconds (Critical). Failover would lose multiple seconds of writes.
- CurrConnections > 50,000 (Warning). Approaching the 65,000 default limit.
- SwapUsage > 0 (Critical). The node is swapping. Performance is destroyed.
Cost Analysis
On-Demand vs. Reserved Pricing
ElastiCache node-based pricing follows the standard AWS model. Reserved instances save real money for predictable workloads, and you should be using them.
| Node Type | vCPUs | Memory | Approx. On-Demand (us-east-1) | 1-yr Reserved (All Upfront) | Savings |
|---|---|---|---|---|---|
| cache.t3.medium | 2 | 3.09 GiB | ~$0.068/hr | ~$0.044/hr | ~35% |
| cache.m6g.large | 2 | 6.38 GiB | ~$0.137/hr | ~$0.087/hr | ~37% |
| cache.r6g.large | 2 | 13.07 GiB | ~$0.166/hr | ~$0.106/hr | ~36% |
| cache.r6g.xlarge | 4 | 26.32 GiB | ~$0.332/hr | ~$0.212/hr | ~36% |
| cache.r6g.4xlarge | 16 | 105.81 GiB | ~$1.326/hr | ~$0.847/hr | ~36% |
| cache.r7g.large | 2 | 13.07 GiB | ~$0.174/hr | ~$0.111/hr | ~36% |
| cache.r6gd.xlarge | 4 | 26.3 GiB + 118 GiB SSD | ~$0.480/hr | ~$0.307/hr | ~36% |
| cache.r6g.16xlarge | 64 | 419.7 GiB | ~$6.567/hr | ~$4.203/hr | ~36% |
Reserved node options:
| Term | Payment Option | Approximate Savings vs. On-Demand |
|---|---|---|
| 1-year | No Upfront | ~25% |
| 1-year | Partial Upfront | ~30% |
| 1-year | All Upfront | ~36% |
| 3-year | All Upfront | ~48% |
Serverless Pricing Comparison
For Valkey serverless (us-east-1):
| Dimension | Rate |
|---|---|
| ECPUs | $0.0034 per million |
| Data storage | $0.125 per GB-hour |
| Minimum storage (Valkey) | 100 MB (~$6/month idle) |
| Minimum storage (Redis OSS) | 1 GB (~$90/month idle) |
Cost crossover analysis: Where exactly provisioned beats serverless depends on your access patterns. In my experience running both, the crossover falls around $800-1,200/month of serverless spend for Valkey workloads. Below that, serverless wins on operational simplicity alone. Above that, provisioned with 1-year reserved instances typically costs less.
Cost Optimization Strategies
- Use reserved instances for stable production workloads. Even 1-year no-upfront reservations save 25%.
- Right-size node types. If DatabaseMemoryUsagePercentage stays below 50%, you're over-provisioned. Shrink the nodes.
- Use Graviton-based instances (r7g, m7g). 20-28% better price-performance than Intel. There's no reason to run Intel for ElastiCache anymore.
- Use data tiering for large datasets with hot/cold access patterns. 60%+ savings versus memory-only nodes add up fast.
- Choose Valkey over Redis OSS. 20% lower node pricing, 33% lower serverless pricing, identical functionality.
- Scale replicas based on actual read throughput. Don't default to 5 replicas per shard. Each replica is a full node on your bill.
- Set TTLs on every cache key. Keys without TTLs grow indefinitely, silently inflating memory and cost.
- Use cluster mode enabled for write scaling rather than upgrading to larger (and pricier) node types.
Common Failure Modes
Memory Pressure and OOM
The failure mode I see most often. When the dataset exceeds maxmemory, the engine either evicts keys (if eviction is configured) or rejects all writes with OOM errors (if noeviction). Extreme cases? The OS OOM killer terminates the process entirely.
Prevention:
- Monitor
DatabaseMemoryUsagePercentageand alarm at 75% (warning) and 85% (critical) - Account for snapshot memory overhead; forks temporarily double memory usage
- Account for memory fragmentation; actual OS memory runs 10-20% higher than logical dataset size
- Enable
activedefrag yesto reduce fragmentation over time - Enable
lazyfree-lazy-eviction yesto reduce latency spikes during eviction
Failover Detection Delay
After a primary node fails, applications keep sending requests to the dead node for 30-60 seconds while ElastiCache detects the failure and promotes a replica. Writes fail during this window. Users notice.
Mitigation:
- Set client-side connection timeouts to 1-2 seconds
- Implement retry logic with exponential backoff and jitter
- Design your application to degrade gracefully when the cache is temporarily gone
- Use circuit breakers to prevent cascading failures
DNS Caching During Failover
This one catches more teams off guard than any other failure mode. After failover, ElastiCache updates the primary endpoint's DNS record (5-second TTL) to point to the promoted replica. If your application caches DNS beyond that TTL, it keeps talking to the old node. Your app thinks the cache is down. In reality it's just talking to the wrong server.
Mitigation:
- JVM: Set
networkaddress.cache.ttl=5(the default is infinite, which is astonishing) - Linux: Configure nscd with low TTL or disable it
- Cluster mode enabled: Use cluster-aware clients that respond to MOVED redirections
- Test failover regularly with the
TestFailoverAPI. Test it now, before you need it.
Large Keys Blocking Operations
Valkey/Redis uses single-threaded command execution. One command operating on a massive data structure blocks everything else until it's done. HGETALL on a 10 MB hash. DEL on a 100 MB sorted set. KEYS on a million-key keyspace. Any of these blocks the engine thread for tens or hundreds of milliseconds, causing timeouts across your entire application fleet.
Mitigation:
- Use
UNLINKinstead ofDELfor large keys. UNLINK reclaims memory in a background thread. - Use
HSCAN,SSCAN,ZSCANinstead of HGETALL, SMEMBERS, ZRANGEBYLEX on large collections - Never use
KEYSin production. UseSCANwith COUNT parameter. - Enable lazy free parameters:
lazyfree-lazy-eviction,lazyfree-lazy-expire,lazyfree-lazy-server-del - Monitor the slow log for commands exceeding 5ms
Lua Script Timeouts
Lua scripts execute atomically. No other commands run until the script completes, so a runaway script blocks the entire shard. The lua-time-limit parameter (default 5,000ms / 5 seconds) controls when the engine starts accepting SCRIPT KILL commands. It doesn't automatically kill the script, though. You have to do that yourself.
Mitigation:
- Keep Lua scripts short and bounded. No unbounded loops over large datasets.
- Test scripts against production-scale data volumes in staging
- Monitor
EngineCPUUtilizationfor sustained 100% spikes that indicate a stuck script - Use EVALSHA (cached compiled script) instead of EVAL (inline script) for frequently executed scripts
- Set a reasonable
lua-time-limitand have runbooks for SCRIPT KILL
Cluster Mode Resharding Impact
During online resharding, the cluster migrates hash slots between shards. Keys in migrating slots experience higher latency, and multi-key operations spanning migrating and non-migrating slots fail with CROSSSLOT errors.
Mitigation:
- Schedule resharding during low-traffic windows
- Monitor latency metrics closely during the operation
- Use hash tags to co-locate related keys, so they migrate together atomically
- Avoid resharding during periods of high write throughput
Caching Patterns
Your caching pattern determines how data flows between application, cache, and database. Each one brings different consistency, latency, and complexity trade-offs.
Cache-Aside (Lazy Loading)
Most common pattern and generally the safest. Application checks the cache first. On a miss, it reads from the database, writes the result to cache with a TTL, and returns it. I use this as my default for every new project until I've got a specific reason to do otherwise.
| Advantage | Disadvantage |
|---|---|
| Only requested data is cached (efficient memory use) | First request for each key incurs cache-miss latency (3 round trips) |
| Cache failure is non-fatal (application degrades to database reads) | Data can become stale until TTL expires or explicit invalidation |
| Simple to implement and reason about | Application code must handle miss/fill logic |
flowchart TD A[Application
needs data] --> B{Check
Cache} B -->|Hit| C[Return
cached data] B -->|Miss| D[Read from
Database] D --> E[Write to Cache
with TTL] E --> F[Return data
to caller] style C fill:#2d7,stroke:#333 style D fill:#e94,stroke:#333
Write-Through
Every database write is immediately followed by a cache write. Once populated, reads always hit the cache.
| Advantage | Disadvantage |
|---|---|
| Cache is always consistent with the database after writes | Write latency increases (two writes per operation) |
| Eliminates stale data for recently written keys | Caches data that may never be read (wasted memory) |
| Simplifies the read path | Cache availability affects write path |
Write-Behind (Write-Back)
Application writes to the cache. The cache asynchronously writes to the database. The cache becomes your primary write target.
| Advantage | Disadvantage |
|---|---|
| Lowest write latency (cache write only) | Data loss risk if cache fails before async write |
| Can batch and coalesce database writes | Complex to implement reliably |
| Absorbs write spikes, protects the database | Debugging data inconsistencies is difficult |
Write-behind carries real risk. I only use it when the performance benefit justifies the complexity and occasional data loss is acceptable. Analytics counters and activity aggregations? Good candidates. Financial records? Absolutely not.
Read-Through
The cache itself loads data from the database on a miss. Application only talks to the cache. ElastiCache doesn't natively support read-through, so you'll need a caching library or application-side middleware to implement it.
TTL Strategies
| Strategy | TTL Range | Use Case |
|---|---|---|
| Short TTL | 5-60 seconds | Near-real-time data: stock prices, live scores, rapidly changing content |
| Medium TTL | 5-30 minutes | Semi-dynamic data: user profiles, product listings, search results |
| Long TTL | 1-24 hours | Slowly changing data: configuration, reference tables, CMS content |
| No TTL + explicit invalidation | Indefinite | Event-driven updates via pub/sub or application logic |
| Jittered TTL | Base +/- 10-20% random | Any pattern: prevents synchronized expiration across related keys |
Set a TTL on every cache key. Even with explicit invalidation, a TTL acts as a safety net for when invalidation fails or gets delayed. I default to 1 hour for most workloads and adjust per key type based on how much staleness the application can tolerate.
Cache Stampede Prevention
A cache stampede (thundering herd) happens when a popular key expires and hundreds of concurrent requests simultaneously miss the cache and slam the database. I've been on the receiving end of these. They cause real outages.
Prevention strategies:
| Strategy | Mechanism | Trade-off |
|---|---|---|
| Jittered TTLs | Add random offset (base TTL +/- 10-20%) | Simple but does not prevent stampede on individual hot keys |
| Lock-based recomputation | Acquire distributed lock (SET NX EX), recompute, release. Others wait or return stale value. | Prevents stampede but adds lock management complexity |
| Early recomputation | Background process refreshes cache before TTL expires | Key never actually expires under normal operation, but requires background infrastructure |
| Stale-while-revalidate | Return stale value immediately, refresh asynchronously | Best user experience but requires storing metadata alongside values |
For high-traffic keys, I combine jittered TTLs with lock-based recomputation. Lock stops the stampede. Jitter prevents synchronized expiration across related keys. Together they handle the vast majority of scenarios.
Key Architectural Patterns
After years running ElastiCache in production, these are the patterns and decisions that matter most:
- Choose Valkey for new deployments. API-compatible with Redis. Fully open source under the Linux Foundation. 20-33% cheaper. AWS is investing its development energy here.
- Start with cluster mode enabled. Migrating from disabled to enabled means creating a new cluster. Start with enabled (even single-shard) and you get horizontal scalability without a migration event later.
- Enable Multi-AZ with automatic failover for every production cluster. Cost: one replica per shard. Cost of skipping it: application degradation, database overload, customer impact during a cache outage. Not a close call.
- Set a TTL on every key. Cache keys without TTLs accumulate silently until they trigger evictions or OOM. A TTL bounds staleness and manages memory simultaneously.
- Monitor EngineCPUUtilization, not just CPUUtilization. For single-threaded Valkey/Redis, EngineCPUUtilization shows actual engine saturation. Overall CPUUtilization on a multi-core node hides a fully saturated engine thread behind averaged-out idle cores.
- Fix DNS caching before your first failover. DNS caching during failover is the most common cause of extended outages after an ElastiCache node failure. Test failover with the TestFailover API in staging. Verify your application follows DNS changes within seconds.
- Never use the KEYS command in production. Use SCAN with a COUNT parameter. KEYS blocks the engine thread for its entire execution, scanning every key. On busy clusters, this cascades into timeouts across every connected application.
- Implement connection pooling from day one. Connection churn wastes resources on both sides, adds latency, and exhausts the node's connection limit during traffic spikes.
- Use data tiering for large, infrequently-accessed datasets. The 60%+ cost savings on r6gd instances are real. ~300-microsecond SSD latency is acceptable for workloads with genuine hot/cold access patterns.
- Design for cache failure. Your application must degrade gracefully when the cache is unavailable. If a cache outage becomes an application outage, you've got a design problem.
Additional Resources
- Amazon ElastiCache Developer Guide: comprehensive reference for all ElastiCache features, configuration options, and API operations
- ElastiCache for Valkey Getting Started Guide: walkthrough for creating and connecting to a Valkey cache in both serverless and node-based modes
- Database Caching Strategies Using Redis: AWS whitepaper covering cache-aside, write-through, and other caching patterns with detailed implementation guidance
- Best Practices for Sizing Your ElastiCache Clusters: AWS-published guidance on node type selection, memory sizing, and connection management
- ElastiCache CloudWatch Metrics Reference: complete list of available metrics with descriptions and recommended alarm thresholds
- ElastiCache Data Tiering Documentation: architecture details, supported node types, and performance characteristics for r6gd data tiering
- ElastiCache Security Best Practices: encryption, authentication, RBAC, IAM integration, and network isolation guidance
- Monitoring Best Practices with ElastiCache Using CloudWatch: detailed guidance on metric selection, alarm configuration, and observability patterns
- Valkey Project Documentation: upstream engine documentation, command reference, and release notes at valkey.io
- Amazon ElastiCache Pricing: current pricing for on-demand, reserved, and serverless across all regions and node types
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

