About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
EC2 is the single largest line item on most AWS bills. It is also the line item where the gap between what teams pay and what they should pay is the widest. I have audited AWS accounts where compute spending dropped 45% in the first month after applying the strategies in this article. No performance loss. No architectural changes. Just purchasing mechanics, instance selection, and scheduling discipline. The savings were always available. The team had never looked.
This is the second article in the AWS Cost Savings series (see AWS S3 Cost Optimization: The Complete Savings Playbook for S3). What follows covers five specific levers for reducing EC2 spend: commitment discounts, Spot Instances, right-sizing, Graviton migration, and scheduling. Each section includes the pricing math, break-even calculations, and the operational gotchas that catch teams who optimize without understanding the trade-offs.
The Five Levers of EC2 Cost Optimization
Before diving into each strategy, here is the landscape. Every EC2 cost reduction falls into one of five categories, and the order matters. Applying them in the wrong sequence wastes effort and leaves savings on the table.
The Optimization Sequence
Right-size first, then migrate to Graviton, then schedule non-production, then commit, then layer in Spot. Each step changes the baseline that the next step operates on. Committing to oversized instances locks in waste for one to three years. Scheduling a fleet before right-sizing it means you schedule instances that are too large. The sequence protects against compounding bad decisions.
flowchart LR
A[Right-Size
Instances] --> B[Migrate to
Graviton]
B --> C[Schedule
Non-Production]
C --> D[Commit via
Savings Plans]
D --> E[Layer in
Spot Instances] Savings by Strategy
| Strategy | Typical Savings | Risk Level | Effort | Applies To |
|---|---|---|---|---|
| Right-sizing | 20-40% | Low | Medium | All instances |
| Graviton migration | 20-40% | Low | Medium | Workloads with ARM support |
| Scheduling | 65-70% | None | Low | Dev, test, staging environments |
| Savings Plans | 30-72% | Medium (commitment) | Low | Steady-state production |
| Spot Instances | Up to 90% | High (interruption) | High | Stateless, fault-tolerant workloads |
The percentages compound. Right-sizing a fleet from m7i.2xlarge to m7i.xlarge saves ~50%. Migrating that right-sized fleet to m7g.xlarge saves another ~20%. Scheduling non-production instances to run 50 hours/week instead of 168 saves another ~70% on those instances. Committing to the remaining steady-state production saves 30-66%. Applied in sequence across a typical mixed fleet, the total reduction from the original On-Demand bill often exceeds 60%.
Commitment Discounts: Reserved Instances vs. Savings Plans
AWS offers two commitment mechanisms: Reserved Instances (RIs) and Savings Plans. Both trade flexibility for discount, and the right choice depends on how predictable your fleet composition is.
Reserved Instances
RIs commit you to a specific instance type, operating system, tenancy, and region for one or three years. In return, AWS discounts the hourly rate by up to 72% for Standard RIs and up to 66% for Convertible RIs. Three payment options exist:
| Payment Option | Discount Depth | Cash Flow Impact |
|---|---|---|
| All Upfront | Highest (~72% for 3-year Standard) | Large upfront payment, no monthly charges |
| Partial Upfront | Medium (~65% for 3-year Standard) | Reduced upfront, reduced monthly |
| No Upfront | Lowest (~40% for 1-year Standard) | No upfront, discounted monthly rate |
Standard RIs lock you into the exact instance type. If you buy m5.xlarge RIs and later right-size to m5.large, the RI still applies (within the same family, size flexibility normalizes the discount). But if you switch families (m5 to m7i), the Standard RI becomes wasted spend. Convertible RIs let you exchange for different instance types, but the discount is shallower.
Savings Plans
Savings Plans commit you to a dollar-per-hour spend rate rather than a specific instance. Three types:
| Plan Type | Flexibility | Max Discount | Scope |
|---|---|---|---|
| Compute Savings Plan | Highest | Up to 66% | EC2, Fargate, Lambda; any region, family, size, OS |
| EC2 Instance Savings Plan | Medium | Up to 72% | EC2 only; specific family and region, any size and OS |
| SageMaker Savings Plan | ML-specific | Up to 64% | SageMaker instance usage only |
Compute Savings Plans are the default recommendation for most teams. The discount is slightly shallower than EC2 Instance Plans (66% vs. 72% at the 3-year All Upfront tier), but the flexibility to change instance families, regions, and even compute services (moving workloads from EC2 to Fargate) without losing the discount makes that trade worthwhile.
The Commitment Decision Framework
My standard approach for commitment purchases:
- Cover 70-80% of steady-state with Compute Savings Plans. This accounts for organic growth and workload shifts without overcommitting.
- Use 1-year terms unless you have high confidence in a 3-year forecast. The incremental savings from 3-year terms (roughly 15-20% more than 1-year) rarely justify the inflexibility for teams whose architectures evolve.
- Layer EC2 Instance Plans on top for workloads that genuinely will not change family or region for the term (database hosts, fixed infrastructure).
- Re-evaluate quarterly. Usage patterns drift. A commitment that covered 80% of your fleet in January may cover 60% by June if you have launched new services or migrated workloads.
Cost Math Example
A fleet of 20 m7i.xlarge instances running 24/7 in us-east-1:
| Pricing Model | Hourly Rate | Monthly Cost (20 instances) | Annual Cost | Savings vs. On-Demand |
|---|---|---|---|---|
| On-Demand | ~$0.2016 | ~$2,903 | ~$34,841 | Baseline |
| 1-year Compute SP (No Upfront) | ~$0.1310 | ~$1,886 | ~$22,637 | 35% |
| 1-year EC2 Instance SP (All Upfront) | ~$0.1150 | ~$1,656 | ~$19,872 | 43% |
| 3-year Compute SP (All Upfront) | ~$0.0820 | ~$1,181 | ~$14,170 | 59% |
| 3-year EC2 Instance SP (All Upfront) | ~$0.0580 | ~$835 | ~$10,022 | 71% |
The spread between 1-year and 3-year is roughly $10,000/year for this 20-instance fleet. That is real money, but only if you are confident those 20 instances will still be m7i.xlarge in the same region three years from now.
Spot Instances: 90% Savings with Production Guardrails
Spot Instances use spare EC2 capacity at discounts up to 90% off On-Demand. AWS can reclaim them with two minutes of notice. That constraint makes them unsuitable for stateful workloads but exceptional for anything that can tolerate interruption.
Where Spot Works
| Workload Type | Spot Fit | Why |
|---|---|---|
| CI/CD build agents | Excellent | Short-lived, stateless, restartable |
| Batch processing (ETL, data pipelines) | Excellent | Checkpointable, parallelizable |
| Container tasks (ECS, EKS) | Good | Orchestrator handles rescheduling |
| Web/API servers behind ALB | Good with caveats | Need mixed instance groups and connection draining |
| Databases | Never | Stateful, interruption causes data loss |
| Single long-running jobs | Poor | Interruption restarts the entire job |
Interruption Rates and Allocation Strategy
AWS publishes interruption frequency data through the Spot Instance Advisor. Less than 5% of Spot Instances get reclaimed by AWS before the customer terminates them voluntarily. The trick is diversification. Requesting a single instance type in a single AZ concentrates your interruption risk. Spreading across multiple instance types, sizes, and AZs through a Spot Fleet or EC2 Auto Scaling group with mixed instance policies drops the effective interruption rate dramatically.
Use the price-capacity-optimized allocation strategy. This selects instances from the pools with the most available capacity at the lowest price. Older strategies (lowest-price, capacity-optimized) optimize for one dimension. The combined strategy handles both.
Interruption Handling
The two-minute warning arrives as an EC2 metadata event and an EventBridge notification. Use both:
- Metadata polling: A daemon on the instance checks
http://169.254.169.254/latest/meta-data/spot/instance-actionevery 5 seconds. When the response changes from 404 to a JSON payload, the instance has two minutes. - EventBridge rule: Triggers a Lambda that deregisters the instance from the target group, drains connections, and pushes state to S3 or DynamoDB.
I have run Spot fleets for CI/CD pipelines that executed 50,000+ builds per month. Total interruption-caused failures over six months: 12. Every one recovered automatically because the build system (Jenkins, in that case) retried on a fresh instance. The cost was 73% below On-Demand. For batch workloads that checkpoint to S3 every 10 minutes, interruption is a non-event: the job resumes from the last checkpoint on a new instance.
Spot and Auto Scaling
Configure Auto Scaling groups with a mixed instances policy. Set a base capacity of On-Demand instances (enough to handle minimum traffic) and fill the rest with Spot. The capacity-rebalancing feature automatically launches replacement Spot Instances when the rebalance recommendation signal fires, before the actual interruption happens. This keeps your fleet stable through capacity shifts.
Right-Sizing with Compute Optimizer
Right-sizing is the most impactful optimization most teams skip. Over-provisioned instances are everywhere. Teams launch an m5.2xlarge "to be safe" during development and never revisit it. That instance runs at 8% average CPU for two years.
How Compute Optimizer Works
AWS Compute Optimizer analyzes 14 days of CloudWatch metrics (CPU utilization, memory utilization via CloudWatch Agent, network I/O, disk I/O) and cross-references them against the performance characteristics of 140+ instance types. It produces three recommendation tiers:
| Recommendation | Meaning | Typical Action |
|---|---|---|
| Over-provisioned | Instance resources exceed workload needs | Downsize to recommended type |
| Under-provisioned | Workload exceeds instance resources | Upsize or change family |
| Optimized | Instance matches workload | No action needed |
In my experience, 40-60% of instances in a typical AWS account are over-provisioned. The median right-sizing recommendation saves ~35% per instance. For a 100-instance fleet averaging $200/month per instance, that translates to $84,000/year in savings from right-sizing alone.
The Right-Sizing Process
- Enable Compute Optimizer across all accounts in your AWS Organization. Free for basic recommendations. Enhanced recommendations (3 months of metric history, memory metrics) cost $0.0003360 per resource per hour.
- Install the CloudWatch Agent for memory metrics. CPU alone is insufficient. An instance running at 15% CPU but 85% memory is correctly sized. Without memory data, Compute Optimizer will recommend downsizing it, and performance will collapse.
- Start with the largest instances. Sort by monthly spend and right-size from the top. The first five instances often account for more savings than the next fifty.
- Test in staging first. Resize the staging instance, run load tests, observe for a week, then resize production.
- Automate ongoing monitoring. Right-sizing is continuous. Workload patterns change. Set up monthly Compute Optimizer reports delivered to your FinOps team.
Graviton Migration: The Easiest 20% You Will Ever Save
AWS Graviton processors (ARM64 architecture) deliver 20% lower price per hour than equivalent Intel/AMD instances, with equal or better performance for most workloads. Graviton4 (the latest generation, available on R8g and M8g instances) provides up to 40% better price-performance than comparable x86 instances.
The Price Difference
For the same vCPU count and memory, Graviton instances consistently cost 20% less:
| Instance Pair | vCPUs | Memory | Intel/AMD Hourly | Graviton Hourly | Savings |
|---|---|---|---|---|---|
| m7i.xlarge vs. m7g.xlarge | 4 | 16 GB | ~$0.2016 | ~$0.1632 | 19% |
| c7i.xlarge vs. c7g.xlarge | 4 | 8 GB | ~$0.1785 | ~$0.1452 | 19% |
| r7i.xlarge vs. r7g.xlarge | 4 | 32 GB | ~$0.2646 | ~$0.2138 | 19% |
| m7i.4xlarge vs. m7g.4xlarge | 16 | 64 GB | ~$0.8064 | ~$0.6528 | 19% |
The 20% price reduction is the floor. Graviton instances also deliver 20-25% better performance per vCPU for most workloads, which means you can often drop an instance size after migration. A workload running on m7i.2xlarge may run equally well on m7g.xlarge, combining the 20% price reduction with a 50% size reduction for a total savings near 60%.
What Migrates Easily
Most modern software stacks run on Graviton without modification:
- Containerized workloads (Docker multi-arch builds): rebuild with
--platform linux/arm64and deploy. If your images already support multi-arch, change the instance type and you are done. - Interpreted languages (Python, Node.js, Ruby, Java): the runtime handles the architecture. Change the instance type. Performance improves.
- Compiled Go and Rust: cross-compile with
GOARCH=arm64or--target aarch64-unknown-linux-gnu. Typically a one-line CI change. - Managed services: RDS, ElastiCache, OpenSearch, and EKS all offer Graviton instance types. Switch the instance class in the configuration.
What Requires Work
- Native C/C++ extensions compiled for x86: need recompilation. Most open-source libraries have ARM packages in modern distros.
- Legacy commercial software with x86-only binaries: no migration path without vendor support.
- Workloads using x86-specific intrinsics (SSE, AVX): rare outside of specialized numerical computing.
Migration Strategy
I migrate in this order: non-production first, then stateless services, then stateful services, then databases.
flowchart TD
A[Identify
Candidates] --> B[Dev/Test
Environments]
B --> C[CI/CD
Build Agents]
C --> D[Stateless Web
and API Servers]
D --> E[Worker Fleets
and Batch Jobs]
E --> F[Caches
ElastiCache/Redis]
F --> G[Databases
RDS/Aurora] For each service, the process is: deploy to a Graviton instance in staging, run the full test suite, observe for a week under production-like load, then switch production. I have migrated fleets of 200+ instances to Graviton across three organizations. The failure rate (workloads that could not run on ARM) was under 5%, and every failure was a native binary dependency with no ARM build available at the time.
Scheduling: Stop Paying for Idle Dev/Test
Development, test, staging, QA, and demo environments typically run 24/7 despite being used only during business hours. A 50-hour work week is 30% of the 168 hours in a calendar week. Scheduling these environments to stop outside business hours saves 70%.
AWS Instance Scheduler
The AWS Instance Scheduler solution deploys via CloudFormation and uses resource tags to control EC2 and RDS instance start/stop schedules. Define a schedule (weekdays 8 AM to 6 PM EST), tag your instances (Schedule=business-hours), and the scheduler stops them every evening and starts them every morning.
The Math
| Scenario | Hours/Week | Monthly Cost (10x m7g.xlarge) | Annual Cost | Savings vs. Always-On |
|---|---|---|---|---|
| Always-on (24/7) | 168 | ~$1,193 | ~$14,312 | Baseline |
| Business hours (10x5) | 50 | ~$355 | ~$4,260 | 70% |
| Extended hours (12x5) | 60 | ~$426 | ~$5,112 | 64% |
| Business + Saturday (10x6) | 60 | ~$426 | ~$5,112 | 64% |
For a team running three environments (dev, test, staging) with 10 instances each at m7g.xlarge, switching from always-on to business-hours scheduling saves ~$30,000/year. That is one configuration change and a set of tags.
Operational Considerations
The biggest pushback I hear: "But our CI pipeline runs overnight." Fair point. Solutions:
- Trigger-based start: Use EventBridge to start instances when a CI pipeline begins (CodePipeline state change event) and stop them when the pipeline completes.
- Separate CI from dev. CI/CD build agents run on Spot Instances (cheap, ephemeral). Dev environments run on scheduled instances. Different lifecycles, different cost profiles.
- Staggered schedules for global teams. Tag instances with timezone-aware schedules. The US dev environment runs 8 AM to 6 PM EST. The India staging environment runs 9 AM to 7 PM IST.
Stacking Strategies: The Optimization Order
The five strategies interact. Applying them in the right order maximizes the compound effect. Here is the full sequence with cumulative impact on a hypothetical 50-instance production fleet running m7i.xlarge at $0.2016/hour On-Demand:
| Step | Action | Per-Instance Rate | Fleet Monthly Cost | Cumulative Savings |
|---|---|---|---|---|
| 0 | Baseline (On-Demand) | $0.2016/hr | $7,258 | 0% |
| 1 | Right-size (50% over-provisioned → m7i.large) | $0.1008/hr | $3,629 | 50% |
| 2 | Migrate to Graviton (m7i.large → m7g.large) | $0.0816/hr | $2,938 | 60% |
| 3 | Commit (1-year Compute SP, ~35% off) | $0.0530/hr | $1,908 | 74% |
| 4 | Move 20% of fleet to Spot (~70% off for those) | Blended ~$0.0470/hr | $1,692 | 77% |
Starting monthly bill: $7,258. Final monthly bill: $1,692. That is a 77% reduction, and the workload runs on better hardware (Graviton outperforms Intel per-vCPU) with no degradation in capacity.
The sequence matters because each step reduces the baseline for the next. Right-sizing before committing means the Savings Plan covers the correct instance size. Migrating to Graviton before committing means the Savings Plan rate reflects the lower Graviton price. Scheduling non-production before committing means you do not overcommit for instances that only run 50 hours/week.
Key Takeaways
- Right-size before everything else. Committing to oversized instances locks in waste for one to three years. Use Compute Optimizer with the CloudWatch Agent installed for memory metrics. Start with the largest instances by spend.
- Graviton is the lowest-effort optimization. A 20% price reduction with equal or better performance. Most modern workloads migrate without code changes. Start with non-production and work up.
- Schedule every non-production environment. Stopping dev/test/staging during non-business hours saves 65-70%. The AWS Instance Scheduler solution handles this with tags and a CloudFormation deployment.
- Use Compute Savings Plans as your default commitment. The flexibility to change instance families, regions, and compute services (EC2 to Fargate) outweighs the 6% deeper discount from EC2 Instance Plans. Cover 70-80% of steady-state usage. Prefer 1-year terms unless you have exceptional forecast confidence.
- Layer Spot Instances on fault-tolerant workloads. CI/CD, batch processing, and containerized stateless services benefit from 60-90% discounts. Use
price-capacity-optimizedallocation, diversify across instance types, and handle the two-minute interruption signal. - Apply the strategies in sequence. Right-size, Graviton, schedule, commit, Spot. Each step lowers the baseline for the next. The compound effect routinely exceeds 60% total reduction.
- Re-evaluate quarterly. Fleet composition, workload patterns, and AWS pricing all change. A Savings Plan that covered 80% of usage in January may cover 55% by September. Compute Optimizer recommendations shift as workloads evolve.
Additional Resources
- AWS EC2 Pricing
- AWS Savings Plans User Guide
- EC2 Spot Instance Best Practices
- AWS Compute Optimizer
- AWS Graviton Processor
- Instance Scheduler on AWS
- Diving Deep into EC2 Spot Instance Cost and Operational Practices
- Building Cost-Effective AWS Step Functions Workflows
- EC2 Instance Comparison Tool (Vantage)
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

