Skip to main content

AWS EC2 Cost Optimization: Five Strategies That Cut Compute Bills in Half

AWSCost OptimizationEC2Compute

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

EC2 is the single largest line item on most AWS bills. It is also the line item where the gap between what teams pay and what they should pay is the widest. I have audited AWS accounts where compute spending dropped 45% in the first month after applying the strategies in this article. No performance loss. No architectural changes. Just purchasing mechanics, instance selection, and scheduling discipline. The savings were always available. The team had never looked.

This is the second article in the AWS Cost Savings series (see AWS S3 Cost Optimization: The Complete Savings Playbook for S3). What follows covers five specific levers for reducing EC2 spend: commitment discounts, Spot Instances, right-sizing, Graviton migration, and scheduling. Each section includes the pricing math, break-even calculations, and the operational gotchas that catch teams who optimize without understanding the trade-offs.

The Five Levers of EC2 Cost Optimization

Before diving into each strategy, here is the landscape. Every EC2 cost reduction falls into one of five categories, and the order matters. Applying them in the wrong sequence wastes effort and leaves savings on the table.

The Optimization Sequence

Right-size first, then migrate to Graviton, then schedule non-production, then commit, then layer in Spot. Each step changes the baseline that the next step operates on. Committing to oversized instances locks in waste for one to three years. Scheduling a fleet before right-sizing it means you schedule instances that are too large. The sequence protects against compounding bad decisions.

flowchart LR
    A[Right-Size
Instances] --> B[Migrate to
Graviton] B --> C[Schedule
Non-Production] C --> D[Commit via
Savings Plans] D --> E[Layer in
Spot Instances]
EC2 cost optimization sequence

Savings by Strategy

StrategyTypical SavingsRisk LevelEffortApplies To
Right-sizing20-40%LowMediumAll instances
Graviton migration20-40%LowMediumWorkloads with ARM support
Scheduling65-70%NoneLowDev, test, staging environments
Savings Plans30-72%Medium (commitment)LowSteady-state production
Spot InstancesUp to 90%High (interruption)HighStateless, fault-tolerant workloads

The percentages compound. Right-sizing a fleet from m7i.2xlarge to m7i.xlarge saves ~50%. Migrating that right-sized fleet to m7g.xlarge saves another ~20%. Scheduling non-production instances to run 50 hours/week instead of 168 saves another ~70% on those instances. Committing to the remaining steady-state production saves 30-66%. Applied in sequence across a typical mixed fleet, the total reduction from the original On-Demand bill often exceeds 60%.

Commitment Discounts: Reserved Instances vs. Savings Plans

AWS offers two commitment mechanisms: Reserved Instances (RIs) and Savings Plans. Both trade flexibility for discount, and the right choice depends on how predictable your fleet composition is.

Reserved Instances

RIs commit you to a specific instance type, operating system, tenancy, and region for one or three years. In return, AWS discounts the hourly rate by up to 72% for Standard RIs and up to 66% for Convertible RIs. Three payment options exist:

Payment OptionDiscount DepthCash Flow Impact
All UpfrontHighest (~72% for 3-year Standard)Large upfront payment, no monthly charges
Partial UpfrontMedium (~65% for 3-year Standard)Reduced upfront, reduced monthly
No UpfrontLowest (~40% for 1-year Standard)No upfront, discounted monthly rate

Standard RIs lock you into the exact instance type. If you buy m5.xlarge RIs and later right-size to m5.large, the RI still applies (within the same family, size flexibility normalizes the discount). But if you switch families (m5 to m7i), the Standard RI becomes wasted spend. Convertible RIs let you exchange for different instance types, but the discount is shallower.

Savings Plans

Savings Plans commit you to a dollar-per-hour spend rate rather than a specific instance. Three types:

Plan TypeFlexibilityMax DiscountScope
Compute Savings PlanHighestUp to 66%EC2, Fargate, Lambda; any region, family, size, OS
EC2 Instance Savings PlanMediumUp to 72%EC2 only; specific family and region, any size and OS
SageMaker Savings PlanML-specificUp to 64%SageMaker instance usage only

Compute Savings Plans are the default recommendation for most teams. The discount is slightly shallower than EC2 Instance Plans (66% vs. 72% at the 3-year All Upfront tier), but the flexibility to change instance families, regions, and even compute services (moving workloads from EC2 to Fargate) without losing the discount makes that trade worthwhile.

The Commitment Decision Framework

Note
Never commit before right-sizing. A 3-year Savings Plan on oversized instances locks in waste at a discounted rate. Right-size first, observe the new baseline for 2-4 weeks, then commit based on actual steady-state usage.

My standard approach for commitment purchases:

  1. Cover 70-80% of steady-state with Compute Savings Plans. This accounts for organic growth and workload shifts without overcommitting.
  2. Use 1-year terms unless you have high confidence in a 3-year forecast. The incremental savings from 3-year terms (roughly 15-20% more than 1-year) rarely justify the inflexibility for teams whose architectures evolve.
  3. Layer EC2 Instance Plans on top for workloads that genuinely will not change family or region for the term (database hosts, fixed infrastructure).
  4. Re-evaluate quarterly. Usage patterns drift. A commitment that covered 80% of your fleet in January may cover 60% by June if you have launched new services or migrated workloads.

Cost Math Example

A fleet of 20 m7i.xlarge instances running 24/7 in us-east-1:

Pricing ModelHourly RateMonthly Cost (20 instances)Annual CostSavings vs. On-Demand
On-Demand~$0.2016~$2,903~$34,841Baseline
1-year Compute SP (No Upfront)~$0.1310~$1,886~$22,63735%
1-year EC2 Instance SP (All Upfront)~$0.1150~$1,656~$19,87243%
3-year Compute SP (All Upfront)~$0.0820~$1,181~$14,17059%
3-year EC2 Instance SP (All Upfront)~$0.0580~$835~$10,02271%

The spread between 1-year and 3-year is roughly $10,000/year for this 20-instance fleet. That is real money, but only if you are confident those 20 instances will still be m7i.xlarge in the same region three years from now.

Spot Instances: 90% Savings with Production Guardrails

Spot Instances use spare EC2 capacity at discounts up to 90% off On-Demand. AWS can reclaim them with two minutes of notice. That constraint makes them unsuitable for stateful workloads but exceptional for anything that can tolerate interruption.

Where Spot Works

Workload TypeSpot FitWhy
CI/CD build agentsExcellentShort-lived, stateless, restartable
Batch processing (ETL, data pipelines)ExcellentCheckpointable, parallelizable
Container tasks (ECS, EKS)GoodOrchestrator handles rescheduling
Web/API servers behind ALBGood with caveatsNeed mixed instance groups and connection draining
DatabasesNeverStateful, interruption causes data loss
Single long-running jobsPoorInterruption restarts the entire job

Interruption Rates and Allocation Strategy

AWS publishes interruption frequency data through the Spot Instance Advisor. Less than 5% of Spot Instances get reclaimed by AWS before the customer terminates them voluntarily. The trick is diversification. Requesting a single instance type in a single AZ concentrates your interruption risk. Spreading across multiple instance types, sizes, and AZs through a Spot Fleet or EC2 Auto Scaling group with mixed instance policies drops the effective interruption rate dramatically.

Use the price-capacity-optimized allocation strategy. This selects instances from the pools with the most available capacity at the lowest price. Older strategies (lowest-price, capacity-optimized) optimize for one dimension. The combined strategy handles both.

Interruption Handling

The two-minute warning arrives as an EC2 metadata event and an EventBridge notification. Use both:

  • Metadata polling: A daemon on the instance checks http://169.254.169.254/latest/meta-data/spot/instance-action every 5 seconds. When the response changes from 404 to a JSON payload, the instance has two minutes.
  • EventBridge rule: Triggers a Lambda that deregisters the instance from the target group, drains connections, and pushes state to S3 or DynamoDB.

I have run Spot fleets for CI/CD pipelines that executed 50,000+ builds per month. Total interruption-caused failures over six months: 12. Every one recovered automatically because the build system (Jenkins, in that case) retried on a fresh instance. The cost was 73% below On-Demand. For batch workloads that checkpoint to S3 every 10 minutes, interruption is a non-event: the job resumes from the last checkpoint on a new instance.

Spot and Auto Scaling

Configure Auto Scaling groups with a mixed instances policy. Set a base capacity of On-Demand instances (enough to handle minimum traffic) and fill the rest with Spot. The capacity-rebalancing feature automatically launches replacement Spot Instances when the rebalance recommendation signal fires, before the actual interruption happens. This keeps your fleet stable through capacity shifts.

Right-Sizing with Compute Optimizer

Right-sizing is the most impactful optimization most teams skip. Over-provisioned instances are everywhere. Teams launch an m5.2xlarge "to be safe" during development and never revisit it. That instance runs at 8% average CPU for two years.

How Compute Optimizer Works

AWS Compute Optimizer analyzes 14 days of CloudWatch metrics (CPU utilization, memory utilization via CloudWatch Agent, network I/O, disk I/O) and cross-references them against the performance characteristics of 140+ instance types. It produces three recommendation tiers:

RecommendationMeaningTypical Action
Over-provisionedInstance resources exceed workload needsDownsize to recommended type
Under-provisionedWorkload exceeds instance resourcesUpsize or change family
OptimizedInstance matches workloadNo action needed

In my experience, 40-60% of instances in a typical AWS account are over-provisioned. The median right-sizing recommendation saves ~35% per instance. For a 100-instance fleet averaging $200/month per instance, that translates to $84,000/year in savings from right-sizing alone.

The Right-Sizing Process

  1. Enable Compute Optimizer across all accounts in your AWS Organization. Free for basic recommendations. Enhanced recommendations (3 months of metric history, memory metrics) cost $0.0003360 per resource per hour.
  2. Install the CloudWatch Agent for memory metrics. CPU alone is insufficient. An instance running at 15% CPU but 85% memory is correctly sized. Without memory data, Compute Optimizer will recommend downsizing it, and performance will collapse.
  3. Start with the largest instances. Sort by monthly spend and right-size from the top. The first five instances often account for more savings than the next fifty.
  4. Test in staging first. Resize the staging instance, run load tests, observe for a week, then resize production.
  5. Automate ongoing monitoring. Right-sizing is continuous. Workload patterns change. Set up monthly Compute Optimizer reports delivered to your FinOps team.
Note
Always install the CloudWatch Agent before trusting Compute Optimizer recommendations. Without memory metrics, recommendations are based on CPU and network alone. I have seen teams downsize database hosts based on low CPU utilization, only to discover the workload was memory-bound. The smaller instance started swapping to disk, and query latency went from 5ms to 500ms.

Graviton Migration: The Easiest 20% You Will Ever Save

AWS Graviton processors (ARM64 architecture) deliver 20% lower price per hour than equivalent Intel/AMD instances, with equal or better performance for most workloads. Graviton4 (the latest generation, available on R8g and M8g instances) provides up to 40% better price-performance than comparable x86 instances.

The Price Difference

For the same vCPU count and memory, Graviton instances consistently cost 20% less:

Instance PairvCPUsMemoryIntel/AMD HourlyGraviton HourlySavings
m7i.xlarge vs. m7g.xlarge416 GB~$0.2016~$0.163219%
c7i.xlarge vs. c7g.xlarge48 GB~$0.1785~$0.145219%
r7i.xlarge vs. r7g.xlarge432 GB~$0.2646~$0.213819%
m7i.4xlarge vs. m7g.4xlarge1664 GB~$0.8064~$0.652819%

The 20% price reduction is the floor. Graviton instances also deliver 20-25% better performance per vCPU for most workloads, which means you can often drop an instance size after migration. A workload running on m7i.2xlarge may run equally well on m7g.xlarge, combining the 20% price reduction with a 50% size reduction for a total savings near 60%.

What Migrates Easily

Most modern software stacks run on Graviton without modification:

  • Containerized workloads (Docker multi-arch builds): rebuild with --platform linux/arm64 and deploy. If your images already support multi-arch, change the instance type and you are done.
  • Interpreted languages (Python, Node.js, Ruby, Java): the runtime handles the architecture. Change the instance type. Performance improves.
  • Compiled Go and Rust: cross-compile with GOARCH=arm64 or --target aarch64-unknown-linux-gnu. Typically a one-line CI change.
  • Managed services: RDS, ElastiCache, OpenSearch, and EKS all offer Graviton instance types. Switch the instance class in the configuration.

What Requires Work

  • Native C/C++ extensions compiled for x86: need recompilation. Most open-source libraries have ARM packages in modern distros.
  • Legacy commercial software with x86-only binaries: no migration path without vendor support.
  • Workloads using x86-specific intrinsics (SSE, AVX): rare outside of specialized numerical computing.

Migration Strategy

I migrate in this order: non-production first, then stateless services, then stateful services, then databases.

flowchart TD
    A[Identify
Candidates] --> B[Dev/Test
Environments] B --> C[CI/CD
Build Agents] C --> D[Stateless Web
and API Servers] D --> E[Worker Fleets
and Batch Jobs] E --> F[Caches
ElastiCache/Redis] F --> G[Databases
RDS/Aurora]
Graviton migration sequence by risk level

For each service, the process is: deploy to a Graviton instance in staging, run the full test suite, observe for a week under production-like load, then switch production. I have migrated fleets of 200+ instances to Graviton across three organizations. The failure rate (workloads that could not run on ARM) was under 5%, and every failure was a native binary dependency with no ARM build available at the time.

Scheduling: Stop Paying for Idle Dev/Test

Development, test, staging, QA, and demo environments typically run 24/7 despite being used only during business hours. A 50-hour work week is 30% of the 168 hours in a calendar week. Scheduling these environments to stop outside business hours saves 70%.

AWS Instance Scheduler

The AWS Instance Scheduler solution deploys via CloudFormation and uses resource tags to control EC2 and RDS instance start/stop schedules. Define a schedule (weekdays 8 AM to 6 PM EST), tag your instances (Schedule=business-hours), and the scheduler stops them every evening and starts them every morning.

The Math

ScenarioHours/WeekMonthly Cost (10x m7g.xlarge)Annual CostSavings vs. Always-On
Always-on (24/7)168~$1,193~$14,312Baseline
Business hours (10x5)50~$355~$4,26070%
Extended hours (12x5)60~$426~$5,11264%
Business + Saturday (10x6)60~$426~$5,11264%

For a team running three environments (dev, test, staging) with 10 instances each at m7g.xlarge, switching from always-on to business-hours scheduling saves ~$30,000/year. That is one configuration change and a set of tags.

Operational Considerations

The biggest pushback I hear: "But our CI pipeline runs overnight." Fair point. Solutions:

  • Trigger-based start: Use EventBridge to start instances when a CI pipeline begins (CodePipeline state change event) and stop them when the pipeline completes.
  • Separate CI from dev. CI/CD build agents run on Spot Instances (cheap, ephemeral). Dev environments run on scheduled instances. Different lifecycles, different cost profiles.
  • Staggered schedules for global teams. Tag instances with timezone-aware schedules. The US dev environment runs 8 AM to 6 PM EST. The India staging environment runs 9 AM to 7 PM IST.
Note
Always use tags rather than hardcoded instance IDs in your scheduling configuration. Instances get replaced during deployments, and a schedule tied to an instance ID breaks silently. Tag-based scheduling survives instance replacement, Auto Scaling group updates, and infrastructure-as-code redeployments.

Stacking Strategies: The Optimization Order

The five strategies interact. Applying them in the right order maximizes the compound effect. Here is the full sequence with cumulative impact on a hypothetical 50-instance production fleet running m7i.xlarge at $0.2016/hour On-Demand:

StepActionPer-Instance RateFleet Monthly CostCumulative Savings
0Baseline (On-Demand)$0.2016/hr$7,2580%
1Right-size (50% over-provisioned → m7i.large)$0.1008/hr$3,62950%
2Migrate to Graviton (m7i.large → m7g.large)$0.0816/hr$2,93860%
3Commit (1-year Compute SP, ~35% off)$0.0530/hr$1,90874%
4Move 20% of fleet to Spot (~70% off for those)Blended ~$0.0470/hr$1,69277%

Starting monthly bill: $7,258. Final monthly bill: $1,692. That is a 77% reduction, and the workload runs on better hardware (Graviton outperforms Intel per-vCPU) with no degradation in capacity.

The sequence matters because each step reduces the baseline for the next. Right-sizing before committing means the Savings Plan covers the correct instance size. Migrating to Graviton before committing means the Savings Plan rate reflects the lower Graviton price. Scheduling non-production before committing means you do not overcommit for instances that only run 50 hours/week.

Key Takeaways

  1. Right-size before everything else. Committing to oversized instances locks in waste for one to three years. Use Compute Optimizer with the CloudWatch Agent installed for memory metrics. Start with the largest instances by spend.
  2. Graviton is the lowest-effort optimization. A 20% price reduction with equal or better performance. Most modern workloads migrate without code changes. Start with non-production and work up.
  3. Schedule every non-production environment. Stopping dev/test/staging during non-business hours saves 65-70%. The AWS Instance Scheduler solution handles this with tags and a CloudFormation deployment.
  4. Use Compute Savings Plans as your default commitment. The flexibility to change instance families, regions, and compute services (EC2 to Fargate) outweighs the 6% deeper discount from EC2 Instance Plans. Cover 70-80% of steady-state usage. Prefer 1-year terms unless you have exceptional forecast confidence.
  5. Layer Spot Instances on fault-tolerant workloads. CI/CD, batch processing, and containerized stateless services benefit from 60-90% discounts. Use price-capacity-optimized allocation, diversify across instance types, and handle the two-minute interruption signal.
  6. Apply the strategies in sequence. Right-size, Graviton, schedule, commit, Spot. Each step lowers the baseline for the next. The compound effect routinely exceeds 60% total reduction.
  7. Re-evaluate quarterly. Fleet composition, workload patterns, and AWS pricing all change. A Savings Plan that covered 80% of usage in January may cover 55% by September. Compute Optimizer recommendations shift as workloads evolve.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.