AWS EC2 Cost Optimization: Five Strategies That Cut Compute Bills in Half

February 25, 2026 at 00:00AWS Cost Optimization EC2 Compute

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

EC2 is the single largest line item on most AWS bills. It is also the line item where the gap between what teams pay and what they should pay is the widest. I have audited AWS accounts where compute spending dropped 45% in the first month after applying the strategies in this article. No performance loss. No architectural changes. Just purchasing mechanics, instance selection, and scheduling discipline. The savings were always available. The team had never looked.

This is the second article in the AWS Cost Savings series (see AWS S3 Cost Optimization: The Complete Savings Playbook for S3). What follows covers five specific levers for reducing EC2 spend: commitment discounts, Spot Instances, right-sizing, Graviton migration, and scheduling. Each section includes the pricing math, break-even calculations, and the operational gotchas that catch teams who optimize without understanding the trade-offs.

The Five Levers of EC2 Cost Optimization

Before diving into each strategy, here is the landscape. Every EC2 cost reduction falls into one of five categories, and the order matters. Applying them in the wrong sequence wastes effort and leaves savings on the table.

The Optimization Sequence

Right-size first, then migrate to Graviton, then schedule non-production, then commit, then layer in Spot. Each step changes the baseline that the next step operates on. Committing to oversized instances locks in waste for one to three years. Scheduling a fleet before right-sizing it means you schedule instances that are too large. The sequence protects against compounding bad decisions.

flowchart LR
    A[Right-Size
Instances] --> B[Migrate to
Graviton]
    B --> C[Schedule
Non-Production]
    C --> D[Commit via
Savings Plans]
    D --> E[Layer in
Spot Instances]

EC2 cost optimization sequence

Savings by Strategy

Strategy	Typical Savings	Risk Level	Effort	Applies To
Right-sizing	20-40%	Low	Medium	All instances
Graviton migration	20-40%	Low	Medium	Workloads with ARM support
Scheduling	65-70%	None	Low	Dev, test, staging environments
Savings Plans	30-72%	Medium (commitment)	Low	Steady-state production
Spot Instances	Up to 90%	High (interruption)	High	Stateless, fault-tolerant workloads

The percentages compound. Right-sizing a fleet from m7i.2xlarge to m7i.xlarge saves ~50%. Migrating that right-sized fleet to m7g.xlarge saves another ~20%. Scheduling non-production instances to run 50 hours/week instead of 168 saves another ~70% on those instances. Committing to the remaining steady-state production saves 30-66%. Applied in sequence across a typical mixed fleet, the total reduction from the original On-Demand bill often exceeds 60%.

Commitment Discounts: Reserved Instances vs. Savings Plans

AWS offers two commitment mechanisms: Reserved Instances (RIs) and Savings Plans. Both trade flexibility for discount, and the right choice depends on how predictable your fleet composition is.

Reserved Instances

RIs commit you to a specific instance type, operating system, tenancy, and region for one or three years. In return, AWS discounts the hourly rate by up to 72% for Standard RIs and up to 66% for Convertible RIs. Three payment options exist:

Payment Option	Discount Depth	Cash Flow Impact
All Upfront	Highest (~72% for 3-year Standard)	Large upfront payment, no monthly charges
Partial Upfront	Medium (~65% for 3-year Standard)	Reduced upfront, reduced monthly
No Upfront	Lowest (~40% for 1-year Standard)	No upfront, discounted monthly rate

Standard RIs lock you into the exact instance type. If you buy m5.xlarge RIs and later right-size to m5.large, the RI still applies (within the same family, size flexibility normalizes the discount). But if you switch families (m5 to m7i), the Standard RI becomes wasted spend. Convertible RIs let you exchange for different instance types, but the discount is shallower.

Savings Plans

Savings Plans commit you to a dollar-per-hour spend rate rather than a specific instance. Three types:

Plan Type	Flexibility	Max Discount	Scope
Compute Savings Plan	Highest	Up to 66%	EC2, Fargate, Lambda; any region, family, size, OS
EC2 Instance Savings Plan	Medium	Up to 72%	EC2 only; specific family and region, any size and OS
SageMaker Savings Plan	ML-specific	Up to 64%	SageMaker instance usage only

Compute Savings Plans are the default recommendation for most teams. The discount is slightly shallower than EC2 Instance Plans (66% vs. 72% at the 3-year All Upfront tier), but the flexibility to change instance families, regions, and even compute services (moving workloads from EC2 to Fargate) without losing the discount makes that trade worthwhile.

The Commitment Decision Framework

Note

Never commit before right-sizing. A 3-year Savings Plan on oversized instances locks in waste at a discounted rate. Right-size first, observe the new baseline for 2-4 weeks, then commit based on actual steady-state usage.

My standard approach for commitment purchases:

Cover 70-80% of steady-state with Compute Savings Plans. This accounts for organic growth and workload shifts without overcommitting.
Use 1-year terms unless you have high confidence in a 3-year forecast. The incremental savings from 3-year terms (roughly 15-20% more than 1-year) rarely justify the inflexibility for teams whose architectures evolve.
Layer EC2 Instance Plans on top for workloads that genuinely will not change family or region for the term (database hosts, fixed infrastructure).
Re-evaluate quarterly. Usage patterns drift. A commitment that covered 80% of your fleet in January may cover 60% by June if you have launched new services or migrated workloads.

Cost Math Example

A fleet of 20 m7i.xlarge instances running 24/7 in us-east-1:

Pricing Model	Hourly Rate	Monthly Cost (20 instances)	Annual Cost	Savings vs. On-Demand
On-Demand	~$0.2016	~$2,903	~$34,841	Baseline
1-year Compute SP (No Upfront)	~$0.1310	~$1,886	~$22,637	35%
1-year EC2 Instance SP (All Upfront)	~$0.1150	~$1,656	~$19,872	43%
3-year Compute SP (All Upfront)	~$0.0820	~$1,181	~$14,170	59%
3-year EC2 Instance SP (All Upfront)	~$0.0580	~$835	~$10,022	71%

The spread between 1-year and 3-year is roughly $10,000/year for this 20-instance fleet. That is real money, but only if you are confident those 20 instances will still be m7i.xlarge in the same region three years from now.

Spot Instances: 90% Savings with Production Guardrails

Spot Instances use spare EC2 capacity at discounts up to 90% off On-Demand. AWS can reclaim them with two minutes of notice. That constraint makes them unsuitable for stateful workloads but exceptional for anything that can tolerate interruption.

Where Spot Works

Workload Type	Spot Fit	Why
CI/CD build agents	Excellent	Short-lived, stateless, restartable
Batch processing (ETL, data pipelines)	Excellent	Checkpointable, parallelizable
Container tasks (ECS, EKS)	Good	Orchestrator handles rescheduling
Web/API servers behind ALB	Good with caveats	Need mixed instance groups and connection draining
Databases	Never	Stateful, interruption causes data loss
Single long-running jobs	Poor	Interruption restarts the entire job

Interruption Rates and Allocation Strategy

AWS publishes interruption frequency data through the Spot Instance Advisor. Less than 5% of Spot Instances get reclaimed by AWS before the customer terminates them voluntarily. The trick is diversification. Requesting a single instance type in a single AZ concentrates your interruption risk. Spreading across multiple instance types, sizes, and AZs through a Spot Fleet or EC2 Auto Scaling group with mixed instance policies drops the effective interruption rate dramatically.

Use the price-capacity-optimized allocation strategy. This selects instances from the pools with the most available capacity at the lowest price. Older strategies (lowest-price, capacity-optimized) optimize for one dimension. The combined strategy handles both.

Interruption Handling

The two-minute warning arrives as an EC2 metadata event and an EventBridge notification. Use both:

Metadata polling: A daemon on the instance checks http://169.254.169.254/latest/meta-data/spot/instance-action every 5 seconds. When the response changes from 404 to a JSON payload, the instance has two minutes.
EventBridge rule: Triggers a Lambda that deregisters the instance from the target group, drains connections, and pushes state to S3 or DynamoDB.

I have run Spot fleets for CI/CD pipelines that executed 50,000+ builds per month. Total interruption-caused failures over six months: 12. Every one recovered automatically because the build system (Jenkins, in that case) retried on a fresh instance. The cost was 73% below On-Demand. For batch workloads that checkpoint to S3 every 10 minutes, interruption is a non-event: the job resumes from the last checkpoint on a new instance.

Spot and Auto Scaling

Configure Auto Scaling groups with a mixed instances policy. Set a base capacity of On-Demand instances (enough to handle minimum traffic) and fill the rest with Spot. The capacity-rebalancing feature automatically launches replacement Spot Instances when the rebalance recommendation signal fires, before the actual interruption happens. This keeps your fleet stable through capacity shifts.

Right-Sizing with Compute Optimizer

Right-sizing is the most impactful optimization most teams skip. Over-provisioned instances are everywhere. Teams launch an m5.2xlarge "to be safe" during development and never revisit it. That instance runs at 8% average CPU for two years.

How Compute Optimizer Works

AWS Compute Optimizer analyzes 14 days of CloudWatch metrics (CPU utilization, memory utilization via CloudWatch Agent, network I/O, disk I/O) and cross-references them against the performance characteristics of 140+ instance types. It produces three recommendation tiers:

Recommendation	Meaning	Typical Action
Over-provisioned	Instance resources exceed workload needs	Downsize to recommended type
Under-provisioned	Workload exceeds instance resources	Upsize or change family
Optimized	Instance matches workload	No action needed

In my experience, 40-60% of instances in a typical AWS account are over-provisioned. The median right-sizing recommendation saves ~35% per instance. For a 100-instance fleet averaging $200/month per instance, that translates to $84,000/year in savings from right-sizing alone.

The Right-Sizing Process

Enable Compute Optimizer across all accounts in your AWS Organization. Free for basic recommendations. Enhanced recommendations (3 months of metric history, memory metrics) cost $0.0003360 per resource per hour.
Install the CloudWatch Agent for memory metrics. CPU alone is insufficient. An instance running at 15% CPU but 85% memory is correctly sized. Without memory data, Compute Optimizer will recommend downsizing it, and performance will collapse.
Start with the largest instances. Sort by monthly spend and right-size from the top. The first five instances often account for more savings than the next fifty.
Test in staging first. Resize the staging instance, run load tests, observe for a week, then resize production.
Automate ongoing monitoring. Right-sizing is continuous. Workload patterns change. Set up monthly Compute Optimizer reports delivered to your FinOps team.

Note

Always install the CloudWatch Agent before trusting Compute Optimizer recommendations. Without memory metrics, recommendations are based on CPU and network alone. I have seen teams downsize database hosts based on low CPU utilization, only to discover the workload was memory-bound. The smaller instance started swapping to disk, and query latency went from 5ms to 500ms.

Graviton Migration: The Easiest 20% You Will Ever Save

AWS Graviton processors (ARM64 architecture) deliver 20% lower price per hour than equivalent Intel/AMD instances, with equal or better performance for most workloads. Graviton4 (the latest generation, available on R8g and M8g instances) provides up to 40% better price-performance than comparable x86 instances.

The Price Difference

For the same vCPU count and memory, Graviton instances consistently cost 20% less:

Instance Pair	vCPUs	Memory	Intel/AMD Hourly	Graviton Hourly	Savings
m7i.xlarge vs. m7g.xlarge	4	16 GB	~$0.2016	~$0.1632	19%
c7i.xlarge vs. c7g.xlarge	4	8 GB	~$0.1785	~$0.1452	19%
r7i.xlarge vs. r7g.xlarge	4	32 GB	~$0.2646	~$0.2138	19%
m7i.4xlarge vs. m7g.4xlarge	16	64 GB	~$0.8064	~$0.6528	19%

The 20% price reduction is the floor. Graviton instances also deliver 20-25% better performance per vCPU for most workloads, which means you can often drop an instance size after migration. A workload running on m7i.2xlarge may run equally well on m7g.xlarge, combining the 20% price reduction with a 50% size reduction for a total savings near 60%.

What Migrates Easily

Most modern software stacks run on Graviton without modification:

Containerized workloads (Docker multi-arch builds): rebuild with --platform linux/arm64 and deploy. If your images already support multi-arch, change the instance type and you are done.
Interpreted languages (Python, Node.js, Ruby, Java): the runtime handles the architecture. Change the instance type. Performance improves.
Compiled Go and Rust: cross-compile with GOARCH=arm64 or --target aarch64-unknown-linux-gnu. Typically a one-line CI change.
Managed services: RDS, ElastiCache, OpenSearch, and EKS all offer Graviton instance types. Switch the instance class in the configuration.

What Requires Work

Native C/C++ extensions compiled for x86: need recompilation. Most open-source libraries have ARM packages in modern distros.
Legacy commercial software with x86-only binaries: no migration path without vendor support.
Workloads using x86-specific intrinsics (SSE, AVX): rare outside of specialized numerical computing.

Migration Strategy

I migrate in this order: non-production first, then stateless services, then stateful services, then databases.

flowchart TD
    A[Identify
Candidates] --> B[Dev/Test
Environments]
    B --> C[CI/CD
Build Agents]
    C --> D[Stateless Web
and API Servers]
    D --> E[Worker Fleets
and Batch Jobs]
    E --> F[Caches
ElastiCache/Redis]
    F --> G[Databases
RDS/Aurora]

Graviton migration sequence by risk level

For each service, the process is: deploy to a Graviton instance in staging, run the full test suite, observe for a week under production-like load, then switch production. I have migrated fleets of 200+ instances to Graviton across three organizations. The failure rate (workloads that could not run on ARM) was under 5%, and every failure was a native binary dependency with no ARM build available at the time.

Scheduling: Stop Paying for Idle Dev/Test

Development, test, staging, QA, and demo environments typically run 24/7 despite being used only during business hours. A 50-hour work week is 30% of the 168 hours in a calendar week. Scheduling these environments to stop outside business hours saves 70%.

AWS Instance Scheduler

The AWS Instance Scheduler solution deploys via CloudFormation and uses resource tags to control EC2 and RDS instance start/stop schedules. Define a schedule (weekdays 8 AM to 6 PM EST), tag your instances (Schedule=business-hours), and the scheduler stops them every evening and starts them every morning.

The Math

Scenario	Hours/Week	Monthly Cost (10x m7g.xlarge)	Annual Cost	Savings vs. Always-On
Always-on (24/7)	168	~$1,193	~$14,312	Baseline
Business hours (10x5)	50	~$355	~$4,260	70%
Extended hours (12x5)	60	~$426	~$5,112	64%
Business + Saturday (10x6)	60	~$426	~$5,112	64%

For a team running three environments (dev, test, staging) with 10 instances each at m7g.xlarge, switching from always-on to business-hours scheduling saves ~$30,000/year. That is one configuration change and a set of tags.

Operational Considerations

The biggest pushback I hear: "But our CI pipeline runs overnight." Fair point. Solutions:

Trigger-based start: Use EventBridge to start instances when a CI pipeline begins (CodePipeline state change event) and stop them when the pipeline completes.
Separate CI from dev. CI/CD build agents run on Spot Instances (cheap, ephemeral). Dev environments run on scheduled instances. Different lifecycles, different cost profiles.
Staggered schedules for global teams. Tag instances with timezone-aware schedules. The US dev environment runs 8 AM to 6 PM EST. The India staging environment runs 9 AM to 7 PM IST.

Note

Always use tags rather than hardcoded instance IDs in your scheduling configuration. Instances get replaced during deployments, and a schedule tied to an instance ID breaks silently. Tag-based scheduling survives instance replacement, Auto Scaling group updates, and infrastructure-as-code redeployments.

Stacking Strategies: The Optimization Order

The five strategies interact. Applying them in the right order maximizes the compound effect. Here is the full sequence with cumulative impact on a hypothetical 50-instance production fleet running m7i.xlarge at $0.2016/hour On-Demand:

Step	Action	Per-Instance Rate	Fleet Monthly Cost	Cumulative Savings
0	Baseline (On-Demand)	$0.2016/hr	$7,258	0%
1	Right-size (50% over-provisioned → m7i.large)	$0.1008/hr	$3,629	50%
2	Migrate to Graviton (m7i.large → m7g.large)	$0.0816/hr	$2,938	60%
3	Commit (1-year Compute SP, ~35% off)	$0.0530/hr	$1,908	74%
4	Move 20% of fleet to Spot (~70% off for those)	Blended ~$0.0470/hr	$1,692	77%

Starting monthly bill: $7,258. Final monthly bill: $1,692. That is a 77% reduction, and the workload runs on better hardware (Graviton outperforms Intel per-vCPU) with no degradation in capacity.

The sequence matters because each step reduces the baseline for the next. Right-sizing before committing means the Savings Plan covers the correct instance size. Migrating to Graviton before committing means the Savings Plan rate reflects the lower Graviton price. Scheduling non-production before committing means you do not overcommit for instances that only run 50 hours/week.

Key Takeaways

Right-size before everything else. Committing to oversized instances locks in waste for one to three years. Use Compute Optimizer with the CloudWatch Agent installed for memory metrics. Start with the largest instances by spend.
Graviton is the lowest-effort optimization. A 20% price reduction with equal or better performance. Most modern workloads migrate without code changes. Start with non-production and work up.
Schedule every non-production environment. Stopping dev/test/staging during non-business hours saves 65-70%. The AWS Instance Scheduler solution handles this with tags and a CloudFormation deployment.
Use Compute Savings Plans as your default commitment. The flexibility to change instance families, regions, and compute services (EC2 to Fargate) outweighs the 6% deeper discount from EC2 Instance Plans. Cover 70-80% of steady-state usage. Prefer 1-year terms unless you have exceptional forecast confidence.
Layer Spot Instances on fault-tolerant workloads. CI/CD, batch processing, and containerized stateless services benefit from 60-90% discounts. Use price-capacity-optimized allocation, diversify across instance types, and handle the two-minute interruption signal.
Apply the strategies in sequence. Right-size, Graviton, schedule, commit, Spot. Each step lowers the baseline for the next. The compound effect routinely exceeds 60% total reduction.
Re-evaluate quarterly. Fleet composition, workload patterns, and AWS pricing all change. A Savings Plan that covered 80% of usage in January may cover 55% by September. Compute Optimizer recommendations shift as workloads evolve.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.

Get in Touch View Background LinkedIn