About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
Infrastructure as Code is one of those concepts that every cloud team claims to practice, yet the architectural differences between the tools they use (and the downstream implications for team velocity, operational safety, and organizational scaling) are rarely examined with the rigor they deserve. I have provisioned and managed infrastructure across hundreds of AWS accounts using all four major IaC tools over the past decade, from wrestling with early CloudFormation YAML to adopting CDK for its high-level abstractions to running Terraform at scale across multi-account landing zones. That experience has given me strong opinions about when each tool shines and where each one will hurt you in production.
This article is an architecture comparison for engineers and architects who need to understand the fundamental tradeoffs between CloudFormation, CDK, Terraform, and Pulumi: the state management models, deployment mechanics, failure modes, and team-scaling characteristics that determine whether your IaC practice accelerates your organization or becomes the bottleneck everyone works around.
What Infrastructure as Code Actually Is
IaC is worth grounding architecturally before comparing tools. Infrastructure as Code is the practice of defining infrastructure resources (compute, networking, storage, IAM policies, DNS records, everything) in machine-readable definition files that are version-controlled, reviewed, tested, and applied through automated pipelines.
The key architectural concepts that differentiate IaC tools:
| Concept | Description | Why It Matters |
|---|---|---|
| Declarative vs. Imperative | Declarative says "I want this end state"; imperative says "execute these steps" | Declarative tools handle ordering and dependency resolution; imperative gives you more control but more rope |
| State Management | How the tool tracks what resources it has created and their current configuration | State is the single most operationally critical aspect of any IaC tool; state corruption or loss can orphan hundreds of resources |
| Plan/Preview | The ability to see what changes will be made before applying them | Essential for production safety; the quality of the plan output varies dramatically between tools |
| Drift Detection | Identifying when actual infrastructure diverges from the declared configuration | Manual changes happen; how the tool handles this reality determines operational hygiene |
| Resource Graph | How the tool models dependencies between resources | Determines parallelism during deployment and correctness of create/update/delete ordering |
| Modularity | How you package and reuse infrastructure patterns | Directly impacts team velocity and consistency as the organization scales |
The lifecycle of an IaC-managed resource follows a consistent pattern regardless of tool:
flowchart TD
A[Define Resource
in Code] --> B[Plan / Preview
Changes]
B --> C{Review
Acceptable?}
C -->|No| A
C -->|Yes| D[Apply Changes
to Cloud Provider]
D --> E[Update State
Record]
E --> F[Drift Detection
Monitoring]
F -->|Drift Detected| G[Investigate and
Reconcile]
G --> A
F -->|No Drift| H[Resource Stable
in Production]
H -->|Change Needed| A The tool you choose determines the ergonomics and reliability of every step in this lifecycle. A bad plan output means you miss destructive changes during review. Poor state management means a failed apply can leave your infrastructure in an inconsistent state that takes hours to untangle. Weak modularity means every team reinvents the same VPC pattern, each with slightly different security group rules.
The Four Tools at a Glance
Here is a high-level orientation of each tool:
| Dimension | CloudFormation | CDK | Terraform | Pulumi |
|---|---|---|---|---|
| Primary Language | YAML / JSON | TypeScript, Python, Java, C#, Go | HCL (HashiCorp Configuration Language) | TypeScript, Python, Go, C#, Java, YAML |
| Maintained By | AWS | AWS | HashiCorp (IBM) | Pulumi Corporation |
| State Management | Service-managed by AWS | Via CloudFormation | State files (local, S3, Terraform Cloud) | Pulumi Cloud, S3, local file |
| Cloud Support | AWS only | AWS only (via CloudFormation) | Multi-cloud (AWS, Azure, GCP, 3000+ providers) | Multi-cloud (AWS, Azure, GCP, 100+ providers) |
| License | Proprietary (free to use) | Apache 2.0 | BSL 1.1 (formerly MPL 2.0) | Apache 2.0 |
| First Release | 2011 | 2019 | 2014 | 2018 |
| Learning Curve | Low (but verbose) | Medium (need to learn constructs) | Medium (need to learn HCL) | Low if you know a supported language |
| Community Size | Large (AWS-captive) | Growing | Very large | Smaller but growing |
| Deployment Speed | Slow (minutes for simple stacks) | Slow (synthesizes to CloudFormation) | Fast (seconds to minutes) | Fast (seconds to minutes) |
The most important row in this table is state management. It determines your operational burden, your blast radius during failures, and your recovery options when things go wrong. We will return to this repeatedly.
CloudFormation: The Native Foundation
CloudFormation is where many AWS practitioners begin their IaC journey, and for good reason: it is free, requires no additional tooling beyond the AWS CLI or Console, and has native integration with every AWS service. It is also where many practitioners eventually hit a wall.
How CloudFormation Works
CloudFormation uses a declarative model where you define your desired infrastructure state in a YAML or JSON template. You submit this template to the CloudFormation service, which creates a "stack," a collection of AWS resources that are provisioned, updated, and deleted as a unit. CloudFormation's orchestration engine analyzes resource dependencies, determines the correct ordering, and makes API calls to the respective AWS services to create or modify resources.
The key architectural insight is that CloudFormation is a fully service-managed orchestration engine. You never run CloudFormation; AWS runs it for you. Your template is submitted to the CloudFormation service, which handles execution entirely within the AWS control plane. This means:
- No state files to manage, back up, or recover
- No credentials to configure beyond normal AWS API access
- No additional infrastructure to run (no Terraform Cloud, no Pulumi service)
- The service handles concurrency, retries, and rollback automatically
Strengths
Native AWS integration is CloudFormation's defining advantage. When AWS launches a new service or feature, CloudFormation support is typically available on day one. Resource types map directly to AWS API resources, and the coverage is comprehensive, with over 1,100 resource types spanning every AWS service.
Change sets allow you to preview what a stack update will do before executing it. You create a change set, review the proposed changes (creates, updates, deletes), and then either execute or discard it. This is CloudFormation's equivalent of terraform plan, though it provides less detail about what specifically changed within a resource.
Stack rollback is automatic by default. If any resource in a stack creation or update fails, CloudFormation rolls back all changes to the previous known-good state. This is a genuine safety net, though it can also become a trap when rollbacks themselves fail, creating the dreaded UPDATE_ROLLBACK_FAILED state.
StackSets enable deploying the same template across multiple AWS accounts and regions from a single management account. For organizations using AWS Organizations, this is a powerful mechanism for enforcing guardrails: deploying SCPs, Config rules, and security baselines consistently across hundreds of accounts.
Weaknesses
Verbosity is CloudFormation's most immediate pain point. A moderately complex VPC with public and private subnets, NAT gateways, route tables, and security groups easily reaches 500+ lines of YAML. Compare this to the equivalent CDK code at around 20 lines or a Terraform module at around 50 lines. This verbosity directly impacts reviewability. When a pull request modifies 800 lines of CloudFormation YAML, reviewers struggle to identify what actually changed.
Deployment speed is consistently slow. Even simple stacks with a handful of resources take 2-5 minutes to deploy. Complex stacks with dozens of resources can take 30-60 minutes. CloudFormation serializes many operations that could safely run in parallel, and its internal polling intervals add latency. During a production incident where you need to push an infrastructure change, this speed penalty is painful.
The 500-resource stack limit forces architectural decisions that have nothing to do with your actual infrastructure. When a stack approaches this limit, you must split it into nested stacks or separate stacks with cross-stack references, adding complexity solely to work around a platform constraint.
Limited programming constructs make it difficult to express conditional logic, loops, or dynamic resource generation. CloudFormation's Conditions, Fn::If, and Fn::ForEach are functional but clumsy compared to actual programming language constructs. When you need to create a variable number of resources based on input parameters, the gymnastics required in CloudFormation are significant.
| Aspect | Detail |
|---|---|
| Max resources per stack | 500 |
| Max nested stack depth | 5 levels |
| Max template size (S3) | 1 MB |
| Max template size (direct) | 51,200 bytes |
| Max parameters | 200 |
| Max outputs | 200 |
| Max mappings | 200 |
| Deployment speed | 2-60 minutes depending on complexity |
When to Use CloudFormation
Use CloudFormation directly when you need maximum AWS-native integration with zero operational overhead for state management, and your templates are relatively simple. It excels for organizational guardrails via StackSets, for teams that cannot introduce additional tooling, and for resources that must be managed through AWS-native mechanisms (such as Service Catalog products). For anything involving complex logic or large-scale infrastructure, strongly consider CDK or Terraform instead.
CDK: CloudFormation with Programming Languages
The AWS Cloud Development Kit is a compiler that targets CloudFormation. You write infrastructure definitions in a real programming language (TypeScript, Python, Java, C#, or Go) and CDK synthesizes those definitions into CloudFormation templates, which are then deployed through the standard CloudFormation service.
How CDK Works
CDK introduces the concept of constructs, reusable cloud components organized in three levels:
| Construct Level | Description | Example |
|---|---|---|
| L1 (Cfn Resources) | Direct 1:1 mapping to CloudFormation resource types | CfnBucket, CfnFunction: every property exposed, no abstractions |
| L2 (Curated) | Higher-level constructs with sensible defaults, convenience methods, and grant patterns | Bucket, Function: encryption enabled by default, .grantRead() method |
| L3 (Patterns) | Opinionated multi-resource patterns that compose L2 constructs | LambdaRestApi: creates API Gateway + Lambda + permissions in one construct |
The construct tree is CDK's core data model. When you instantiate constructs in your code, CDK builds a tree of nodes. Each node in the tree maps to one or more CloudFormation resources. CDK then walks this tree to generate a CloudFormation template (the "cloud assembly"), which is stored in the cdk.out directory.
The synthesis step (cdk synth) is purely local. No AWS API calls are made. The resulting CloudFormation template is what actually gets deployed, meaning that every CDK deployment is ultimately a CloudFormation deployment with all of its characteristics, including its speed limitations and resource limits.
Strengths
Real programming language power is CDK's transformative advantage. You can use loops, conditionals, functions, classes, interfaces, type systems, and package managers, applying the full arsenal of software engineering practices to infrastructure. A VPC that takes 500 lines of CloudFormation YAML becomes:
new ec2.Vpc(this, 'MyVpc', {
maxAzs: 3,
natGateways: 1,
});
Those 3 lines synthesize to approximately 30 CloudFormation resources (subnets, route tables, NAT gateways, internet gateway, and all the associations between them) with sensible defaults applied throughout.
Type safety catches errors at compile time rather than during deployment. If you misspell a property name or pass the wrong type, your IDE and compiler flag it immediately. Compare this to CloudFormation, where a typo in a YAML property name simply gets ignored and your resource deploys with an unexpected default value.
The grant pattern is a high-level abstraction for IAM permissions. Instead of writing IAM policies by hand (a notorious source of both over-permissioning and broken deployments), you write:
bucket.grantRead(lambdaFunction);
CDK generates the minimum-privilege IAM policy automatically. This pattern alone eliminates an entire category of security and operational errors.
Aspect-driven development allows you to apply cross-cutting concerns (tags, encryption requirements, compliance rules) across all constructs in a stack without modifying each one individually. This is powerful for organizational standards enforcement.
Weaknesses
CloudFormation under the hood means CDK inherits every CloudFormation limitation. The 500-resource limit, slow deployment speed, verbose change sets, and occasional stack rollback failures all apply. CDK gives you a better authoring experience, but the deployment experience is unchanged.
Construct library gaps exist. While L2 constructs cover the most common services well, newer or less popular AWS services may only have L1 constructs available, which means you are back to writing CloudFormation-level code in a programming language, losing most of CDK's ergonomic benefits.
cdk diff is misleading. The output of cdk diff shows the difference between your synthesized template and the deployed stack, but it does not account for drift. If someone made manual changes to a resource through the console, cdk diff will not show you that your deployment is about to revert those changes. This is a CloudFormation limitation that CDK inherits and makes worse by giving you false confidence that you understand what your deployment will do.
Construct versioning complexity grows with team size. When multiple teams publish and consume shared construct libraries, version conflicts between constructs that depend on different versions of the CDK core library create dependency resolution headaches that feel more like application dependency management than infrastructure management.
When to Use CDK
CDK is the right choice when your team is committed to AWS (no multi-cloud requirements), your engineers are comfortable in TypeScript or Python, and you want the highest-level abstractions available for AWS infrastructure. It is particularly powerful for teams that deploy application code alongside infrastructure. CDK handles Lambda bundling, Docker image building, and S3 asset uploads as part of the synthesis process. The caveat is that you must accept CloudFormation's deployment characteristics, which means CDK is not ideal for environments requiring fast iteration cycles on infrastructure changes.
Terraform: The Multi-Cloud Standard
Terraform has become the de facto standard for infrastructure as code across the industry, and for good reason. Its provider model, state management system, and plan/apply workflow create an IaC experience that scales from a single developer managing a handful of resources to platform teams managing thousands of resources across multiple cloud providers.
How Terraform Works
Terraform uses HashiCorp Configuration Language (HCL), a declarative language purpose-built for infrastructure definitions. HCL sits between YAML's simplicity and a general-purpose programming language's power. It supports variables, expressions, loops (for_each, count), conditionals, functions, and local values, but does not support arbitrary computation.
Terraform's architecture has three key components:
- Core engine: Parses HCL, builds a resource dependency graph, determines the order of operations, and coordinates with providers
- Providers: Plugins that translate Terraform resource definitions into API calls for specific platforms (AWS, Azure, GCP, Kubernetes, Datadog, PagerDuty, and 3,000+ others)
- State: A JSON file that records the mapping between your Terraform configuration and the real-world resources it manages
The plan/apply workflow is Terraform's defining operational pattern:
terraform plan: Reads state, queries the cloud provider for current resource configurations, compares against your HCL, and produces a detailed execution plan showing exactly what will be created, modified, or destroyed- Review the plan: A human or automated policy engine evaluates the proposed changes
terraform apply: Executes the plan, makes API calls to the cloud provider, and updates the state file
This workflow is simple, predictable, and auditable. The plan output is detailed enough to catch most dangerous changes before they execute. You can see that a security group modification will cause a replacement rather than an in-place update, for example.
Strengths
Multi-cloud support is Terraform's most significant architectural advantage. The provider model creates a consistent workflow regardless of the target platform. The same plan → review → apply cycle works for AWS resources, Azure resources, Kubernetes objects, DNS records, monitoring dashboards, and PagerDuty escalation policies. For organizations using multiple cloud providers or managing infrastructure beyond just cloud resources, this consistency is valuable.
The plan output is excellent. Terraform's plan clearly shows each resource that will change, what attributes are changing, whether the change is in-place or requires replacement, and which changes are known before apply versus computed during apply. This level of detail enables meaningful code review of infrastructure changes.
State management flexibility allows you to choose the backend that fits your operational model. For production use, S3 with DynamoDB locking is the standard AWS pattern: the state file lives in a versioned S3 bucket with DynamoDB providing distributed locking to prevent concurrent modifications. Terraform Cloud and Terraform Enterprise provide managed state with additional features like remote plan execution and policy enforcement.
Module ecosystem is mature and extensive. The Terraform Registry hosts thousands of modules covering common patterns such as VPCs, EKS clusters, RDS instances, and more. The module system supports versioning, input validation, and output values, making it straightforward to publish and consume reusable infrastructure patterns.
Deployment speed is noticeably faster than CloudFormation for most operations. Terraform makes API calls directly to cloud providers without an intermediary orchestration service, and it parallelizes resource creation aggressively based on the dependency graph. A stack that takes 15 minutes in CloudFormation might complete in 3-5 minutes with Terraform.
Weaknesses
State file management is Terraform's most significant operational burden. The state file is the source of truth for what Terraform manages. If the state file is lost, Terraform loses track of all managed resources; they continue to exist in the cloud but Terraform can no longer manage them. If the state file becomes corrupted (due to a failed apply, a concurrent modification, or manual editing), recovery can require careful surgery with terraform state commands.
State file management best practices:
| Practice | Implementation | Why |
|---|---|---|
| Remote backend | S3 + DynamoDB | Never store state locally in production |
| State locking | DynamoDB table | Prevents concurrent modifications |
| Versioning | S3 bucket versioning | Enables rollback if state is corrupted |
| Encryption | SSE-S3 or SSE-KMS | State files contain sensitive data (ARNs, IDs, sometimes secrets) |
| Access control | Least-privilege IAM | Only CI/CD pipelines should write state |
| State file per environment | Separate backends or workspaces | Limits blast radius |
HCL is neither here nor there. It is more powerful than YAML but less powerful than a programming language. For simple infrastructure, HCL is pleasant. For complex infrastructure with heavy conditional logic, dynamic resource generation, or sophisticated data transformation, HCL's limitations become frustrating. The for_each meta-argument and dynamic blocks help, but they are syntactically awkward compared to equivalent code in Python or TypeScript.
Provider version drift is a persistent operational challenge. Terraform providers are versioned independently of Terraform core, and provider updates can introduce breaking changes. Pinning provider versions is essential, but it creates a maintenance burden; you must periodically update provider versions and validate that nothing breaks.
The BSL license change in August 2023 is worth acknowledging. HashiCorp relicensed Terraform from MPL 2.0 to BSL 1.1, which restricts competitive use. For most end-user organizations, this changes nothing; you can still use Terraform freely for managing your own infrastructure. But it sparked the creation of OpenTofu, an MPL-licensed fork maintained by the Linux Foundation, and it has introduced uncertainty about Terraform's long-term community trajectory.
When to Use Terraform
Terraform is the right choice when you need multi-cloud support, want a battle-tested tool with a massive community and module ecosystem, or need fast deployment cycles. It is particularly strong for platform teams building self-service infrastructure for development teams, for organizations managing infrastructure beyond just AWS (DNS, monitoring, CI/CD, SaaS tools), and for teams that want the most mature IaC ecosystem available. The operational cost of state management is real but well-understood and manageable with proper practices.
Pulumi: IaC as Real Code
Pulumi takes the idea that started with CDK (using real programming languages for infrastructure) and applies it across clouds without an intermediary compilation step. Where CDK synthesizes to CloudFormation and then deploys, Pulumi's engine directly manages resource provisioning using the cloud provider APIs.
How Pulumi Works
Pulumi programs are written in general-purpose programming languages: TypeScript, Python, Go, C#, Java, or YAML. When you run pulumi up, the Pulumi engine executes your program, constructs a resource dependency graph from the resources your code declares, compares this desired state against the current state (stored in a backend), and computes a plan showing what needs to change. After review, the engine executes the plan by making API calls through Pulumi providers.
The critical difference from CDK is that Pulumi's engine talks directly to cloud APIs; there is no CloudFormation intermediate step. This means:
- No 500-resource stack limit
- Faster deployments (direct API calls, aggressive parallelization)
- Better error messages (errors come from the cloud API, not from CloudFormation's abstraction layer)
- No dependency on CloudFormation's rollback mechanics (which can themselves fail)
Strengths
Real programming languages without CloudFormation's limitations. Pulumi gives you the same language advantages as CDK (loops, conditionals, type safety, package management) without inheriting CloudFormation's operational constraints. You get the best of both worlds: high-level programming with direct cloud API execution.
Testing with standard frameworks is where Pulumi pulls ahead of every other tool. Because your infrastructure is real code, you can write unit tests with pytest, Jest, or Go's testing package. You can mock cloud resources, test conditional logic, validate that security policies are correctly applied, and run these tests in your CI pipeline before any infrastructure is provisioned.
Multi-cloud with real languages combines Terraform's multi-cloud breadth with CDK's programming language power. You can define AWS, Azure, and Kubernetes resources in the same Python or TypeScript program with full IDE support and type safety.
Pulumi AI and import capabilities allow you to describe infrastructure in natural language and generate Pulumi code, and to import existing cloud resources into Pulumi management. The import experience is smoother than Terraform's, generating the code needed to manage the imported resource going forward.
Stack references provide type-safe cross-stack references. When one Pulumi stack needs to reference outputs from another (a networking stack exporting VPC IDs for consumption by application stacks), the references are type-checked at compile time rather than being string-based lookups that fail at deploy time.
Weaknesses
Smaller community and fewer examples. When you search for "how to configure X with Terraform," you find dozens of blog posts, Stack Overflow answers, and module examples. The equivalent search for Pulumi yields far fewer results. This is improving, but the ecosystem gap is real and impacts productivity, especially when troubleshooting edge cases.
Pulumi Cloud dependency for full features. While Pulumi's engine is open source and you can use S3 or local file backends for state, features like secrets management, deployment history, RBAC, and drift detection are only available through Pulumi Cloud (the managed SaaS service). The free tier is generous, but organizations with strict data sovereignty requirements or aversion to SaaS dependencies may find this limiting.
Provider parity gaps. Pulumi providers for AWS are generated from the Terraform AWS provider, meaning coverage is comprehensive. But some Pulumi-native features and provider behaviors differ from Terraform's, and the documentation sometimes lags the Terraform equivalent. When you encounter an undocumented behavior in a Pulumi AWS resource, you often end up reading the Terraform provider source code to understand what is happening.
Organizational adoption friction. Terraform has established itself as the industry default. Proposing Pulumi to an organization that has standardized on Terraform requires justifying the switch: retraining teams, migrating state, replacing modules, and updating CI/CD pipelines. The technical advantages are real, but the migration cost often outweighs them for established organizations.
When to Use Pulumi
Pulumi is the right choice when your team has strong software engineering practices and wants to apply them to infrastructure, when you need multi-cloud support without giving up programming language power, or when testing infrastructure code is a priority. It is particularly compelling for startups and greenfield projects where there is no existing Terraform investment to migrate, and for teams that deploy application code and infrastructure together as a unified program.
Head-to-Head Comparison
This is the table I wish I had when evaluating these tools. Every dimension is scored from production experience, not marketing materials.
| Dimension | CloudFormation | CDK | Terraform | Pulumi |
|---|---|---|---|---|
| Language support | YAML, JSON | TypeScript, Python, Java, C#, Go | HCL | TypeScript, Python, Go, C#, Java, YAML |
| State management | Service-managed | Service-managed (via CFN) | State files (S3, Terraform Cloud, local) | Pulumi Cloud, S3, local |
| Drift detection | Native (CloudFormation drift) | Via CloudFormation | terraform plan detects drift on every run | Pulumi Cloud only |
| Secret handling | Dynamic references to SSM/Secrets Manager | Dynamic references, context values | Sensitive variable marking, Vault integration | Native encryption per-value in state |
| Testing story | cfn-lint, taskcat | CDK assertions, snapshot tests | terraform validate, Terratest, OPA/Sentinel | Native unit tests with standard frameworks |
| Multi-account | StackSets | CDK Pipelines, custom | Workspaces, separate state per account | Stack per account |
| Import existing resources | aws cloudformation import | Limited support | terraform import (well-supported) | pulumi import (generates code) |
| Rollback behavior | Automatic stack rollback | Automatic (via CloudFormation) | No automatic rollback; partial state | No automatic rollback; partial state |
| Deployment speed | Slow (minutes) | Slow (minutes, via CloudFormation) | Fast (seconds to minutes) | Fast (seconds to minutes) |
| Cost | Free | Free | Free (OSS); Terraform Cloud from $0 | Free (OSS); Pulumi Cloud from $0 |
| CI/CD integration | AWS-native (CodePipeline) | CDK Pipelines | Excellent (all CI systems) | Good (all CI systems) |
| Module/construct ecosystem | Limited (nested stacks) | Construct Hub (growing) | Terraform Registry (massive) | Pulumi Registry (growing) |
| Error messages | Poor (cryptic CFN errors) | Poor (CFN errors underneath) | Good (direct API errors) | Good (direct API errors) |
| Debugging experience | Difficult (service-managed) | Moderate (can debug synth locally) | Good (local execution, verbose logging) | Good (local execution, standard debuggers) |
| Refactoring support | Painful (resource replacement risk) | Moderate (construct tree changes) | terraform state mv (manual but works) | pulumi state rename |
| Preview/plan quality | Change sets (limited detail) | CDK diff (misses drift) | Excellent (detailed attribute-level diff) | Good (detailed, similar to Terraform) |
| Resource limits | 500 per stack | 500 per stack (CFN limit) | None (practical limit is state file size) | None |
| Execution model | Service-managed (AWS runs it) | Local synth, then service-managed | Local execution (or remote via TFC) | Local execution (or remote via Pulumi Cloud) |
| Lock-in risk | High (AWS-only, proprietary) | High (AWS-only, via CloudFormation) | Medium (BSL license, but HCL is transferable to OpenTofu) | Low (Apache 2.0, standard languages) |
| Maturity | Very high (15 years) | Moderate (7 years) | Very high (12 years) | Moderate (8 years) |
State Management Architectures
State management is where IaC tools differ most fundamentally, and it is the aspect with the highest operational impact. Understanding how each tool manages state is essential for making an informed choice.
flowchart LR
subgraph CFN["CloudFormation / CDK"]
direction TB
CFN_USER[Engineer] -->|Submit
Template| CFN_SVC[CloudFormation
Service]
CFN_SVC -->|Manages| CFN_STATE[(AWS-Managed
State)]
CFN_SVC -->|API Calls| CFN_AWS[AWS
Resources]
end
subgraph TF["Terraform"]
direction TB
TF_USER[Engineer] -->|terraform
apply| TF_CLI[Terraform
CLI]
TF_CLI -->|Read/Write| TF_STATE[(State File
S3 + DynamoDB)]
TF_CLI -->|API Calls| TF_AWS[Cloud
Resources]
end
subgraph PL["Pulumi"]
direction TB
PL_USER[Engineer] -->|pulumi
up| PL_CLI[Pulumi
CLI]
PL_CLI -->|Read/Write| PL_STATE[(State Backend
Pulumi Cloud / S3)]
PL_CLI -->|API Calls| PL_AWS[Cloud
Resources]
end CloudFormation State (Service-Managed)
CloudFormation's state is entirely managed by the AWS service. You never see, touch, or worry about a state file. When you create a stack, CloudFormation records every resource it creates and their configurations. When you update a stack, it compares your new template against its internal state to determine what needs to change.
Advantages: Zero operational overhead. No state file to back up, no locking to configure, no corruption recovery to plan for. State is highly available and durable within the AWS service.
Disadvantages: You cannot inspect, modify, or repair state directly. When a stack enters UPDATE_ROLLBACK_FAILED state, your options are limited to ContinueUpdateRollback (which may not work) or deleting the entire stack and recreating it. You also cannot easily split a stack, merge stacks, or move resources between stacks, all operations that are straightforward with Terraform state commands.
Terraform State (Self-Managed)
Terraform state is a JSON file that you are responsible for storing, securing, backing up, and managing concurrent access to. In production, this means:
- S3 backend stores the state file in a versioned, encrypted bucket
- DynamoDB table provides distributed locking to prevent concurrent applies
- IAM policies restrict who can read and write state
- State file per environment (or workspace) limits the blast radius of state corruption
Terraform state contains sensitive information: resource IDs, ARNs, and sometimes actual secret values (if you use terraform output for sensitive data). Treating the state file as a secret is essential.
Recovery patterns:
| Scenario | Recovery Approach |
|---|---|
| State file lost | Restore from S3 versioning, or terraform import each resource |
| State corruption | Restore previous version from S3 versioning |
| Concurrent modification | DynamoDB lock prevents this; if lock is stuck, terraform force-unlock |
| Resource exists but not in state | terraform import to bring it under management |
| Resource in state but deleted manually | terraform state rm to remove the stale entry |
| Need to move resource between states | terraform state mv with -state-out flag |
Pulumi State
Pulumi state follows a similar model to Terraform's (an external state backend that the CLI reads and writes) but with different backend options:
- Pulumi Cloud (default): managed SaaS with built-in encryption, history, RBAC, and secrets management
- S3: self-managed, similar to Terraform's S3 backend
- Local file: for development only
- Azure Blob Storage, Google Cloud Storage: for respective cloud users
When using Pulumi Cloud, state management is operationally similar to CloudFormation; someone else handles the infrastructure for state storage. When using S3, you carry the same operational burden as Terraform.
Multi-Account and Multi-Region Patterns
Production AWS environments invariably span multiple accounts. How each IaC tool handles multi-account deployment is a critical architectural consideration.
| Pattern | CloudFormation | CDK | Terraform | Pulumi |
|---|---|---|---|---|
| Deploy to N accounts | StackSets (native) | CDK Pipelines, custom stages | Workspace per account or separate root modules | Stack per account |
| Cross-account references | Stack exports + imports | SSM Parameter Store lookups | Remote state data sources | Stack references (type-safe) |
| Centralized pipeline | StackSets from management account | CDK Pipelines with cross-account roles | Terraform Cloud workspaces with team-based access | Pulumi Cloud with stack RBAC |
| Drift detection at scale | Native (per-stack) | Via CloudFormation | terraform plan per workspace | Pulumi Cloud drift detection |
| Organizational guardrails | SCPs + StackSets | SCPs + CDK Aspects | Sentinel/OPA policies | Pulumi CrossGuard |
flowchart TD
CICD[CI/CD Pipeline] --> MGMT_DEPLOY{Deploy to
Management Account}
CICD --> SHARED_DEPLOY{Deploy to
Shared Services Account}
CICD --> WORKLOAD_DEPLOY{Deploy to
Workload Accounts}
subgraph MGMT["Management Account"]
MGMT_DEPLOY --> ORG[AWS Organizations
SCPs]
MGMT_DEPLOY --> GUARD[Guardrail
Templates]
end
subgraph SHARED["Shared Services Account"]
SHARED_DEPLOY --> VPC_SHARED[Transit Gateway
Central Networking]
SHARED_DEPLOY --> DNS[Route 53
Central DNS]
SHARED_DEPLOY --> LOG[CloudWatch
Central Logging]
end
subgraph WORKLOAD["Workload Accounts (N)"]
WORKLOAD_DEPLOY --> VPC_WL[VPC
App Networking]
WORKLOAD_DEPLOY --> COMPUTE[ECS / EKS / Lambda
Compute]
WORKLOAD_DEPLOY --> DATA[RDS / DynamoDB
Data Stores]
end
VPC_SHARED -.->|Transit Gateway
Attachment| VPC_WL
LOG -.->|Cross-Account
Log Delivery| COMPUTE CloudFormation StackSets are the most operationally simple approach for multi-account deployment. From a single management account (or delegated administrator account), you define a stack set and specify the target accounts and regions. CloudFormation handles deploying, updating, and monitoring the stacks across all targets. StackSets support automatic deployment to new accounts added to an organizational unit, so new accounts inherit your guardrails without manual intervention.
Terraform multi-account typically uses one of two patterns: separate root modules per account with separate state files, or workspaces within a single configuration. The separate root modules approach provides stronger isolation (a mistake in one account's Terraform cannot affect another), while the workspace approach reduces code duplication. Most organizations I have worked with prefer separate root modules for production workloads and workspaces for non-production environments.
CDK Pipelines provide a structured multi-account deployment pipeline that deploys CDK stacks across environments (dev, staging, production) in sequence with manual approval gates between stages. The pipeline itself runs in a designated pipeline account and assumes cross-account roles to deploy to target accounts.
Pulumi stacks are the natural unit of multi-account deployment. Each AWS account gets its own stack with its own state and configuration. Pulumi's type-safe stack references make it straightforward to pass outputs (VPC IDs, subnet IDs, security group IDs) from a shared networking stack to application stacks in different accounts.
Testing Strategies
Testing infrastructure code is fundamentally different from testing application code. You cannot spin up an AWS account in a unit test the way you spin up an in-memory database. Each tool approaches this challenge differently.
| Testing Type | CloudFormation | CDK | Terraform | Pulumi |
|---|---|---|---|---|
| Static analysis / linting | cfn-lint, cfn-nag | cdk-nag, ESLint/Pylint | terraform validate, tflint, checkov | Standard language linters, checkov |
| Unit testing | Not applicable (YAML) | CDK assertions library | Not directly (HCL); Terratest for Go-based tests | Native unit tests (Jest, pytest, Go test) |
| Snapshot testing | Not applicable | CDK snapshot assertions | Not applicable | Not directly supported |
| Policy testing | AWS Config rules | CDK Aspects, cdk-nag | Sentinel (paid), OPA/Conftest (free) | CrossGuard (Pulumi Cloud), OPA |
| Integration testing | taskcat (deploy and validate) | integ-runner (experimental) | Terratest (deploy, validate, destroy) | Automation API (programmatic deployments) |
| Preview/dry run | Change sets | cdk diff | terraform plan | pulumi preview |
CDK's testing story is the most interesting for unit testing specifically. CDK's assertions library lets you write tests like:
template.hasResourceProperties('AWS::S3::Bucket', {
BucketEncryption: {
ServerSideEncryptionConfiguration: [{
ServerSideEncryptionByDefault: {
SSEAlgorithm: 'aws:kms'
}
}]
}
});
This test validates that the synthesized CloudFormation template contains an S3 bucket with KMS encryption, without deploying anything. Snapshot tests go further, capturing the entire synthesized template and alerting you when it changes unexpectedly.
Terraform's testing story relies heavily on Terratest, a Go library that programmatically runs terraform apply, validates the resulting infrastructure by making API calls, and then runs terraform destroy to clean up. It is thorough but slow (each test deploys real infrastructure) and requires Go knowledge.
Pulumi's testing story is the most natural for software engineers. Because infrastructure is defined in a real programming language, you can mock the Pulumi engine and write fast unit tests that validate your infrastructure logic without making any cloud API calls. Combined with integration tests using Pulumi's Automation API, this provides a comprehensive testing pyramid for infrastructure code.
CI/CD Integration
How each tool fits into automated deployment pipelines is a practical concern that influences the day-to-day developer experience.
| Aspect | CloudFormation | CDK | Terraform | Pulumi |
|---|---|---|---|---|
| Native CI/CD | CodePipeline + CloudFormation deploy action | CDK Pipelines (CodePipeline-based) | Terraform Cloud, GitHub Actions, all major CI | Pulumi Deployments, GitHub Actions, all major CI |
| GitOps pattern | Limited (no native Git integration) | Via CDK Pipelines | Well-supported (Atlantis, Spacelift, env0) | Pulumi Deployments (Git push triggers) |
| Plan in PR | Change set as PR comment (custom) | CDK diff as PR comment (custom) | Native with Atlantis, Spacelift, TFC | Native with Pulumi Deployments |
| Approval gates | CodePipeline manual approval | CDK Pipelines manual approval | Terraform Cloud run approval, Sentinel policies | Pulumi Deployments approval, CrossGuard |
| Concurrent deployments | Stack-level locking (automatic) | Stack-level locking (via CFN) | State locking (DynamoDB/TFC) | State locking (Pulumi Cloud/backend) |
The most mature CI/CD ecosystem belongs to Terraform. Tools like Atlantis (open source) and Spacelift (SaaS) provide a seamless pull-request-driven workflow: open a PR that modifies Terraform code, the CI tool runs terraform plan and posts the output as a PR comment, reviewers evaluate the plan alongside the code changes, and merging the PR triggers terraform apply. This workflow (infrastructure changes reviewed as code, with the plan output providing concrete visibility into what will change) is the gold standard for operational safety.
CDK Pipelines provides a similar experience within the AWS ecosystem, but it is tightly coupled to CodePipeline, which limits flexibility. Pulumi Deployments is newer but follows the same pattern of Git-triggered deployments with plan-in-PR feedback.
Cost and Licensing
| Tool | Engine Cost | Managed Service | License | Commercial Options |
|---|---|---|---|---|
| CloudFormation | Free | N/A (it is the managed service) | Proprietary | N/A |
| CDK | Free (synthesizes to CFN) | N/A (deploys via CloudFormation) | Apache 2.0 | N/A |
| Terraform CLI | Free | Terraform Cloud: Free tier (500 resources), Team ($20/user/month), Business (custom) | BSL 1.1 | Terraform Enterprise (self-hosted) |
| OpenTofu | Free | None (community-managed) | MPL 2.0 | Third-party services (Spacelift, env0) |
| Pulumi CLI | Free | Pulumi Cloud: Individual (free), Team ($50/user/month), Enterprise (custom) | Apache 2.0 | Pulumi Business Critical (self-hosted) |
The cost comparison is nuanced. CloudFormation and CDK are free because AWS makes money on the resources you provision, not the tool you use to provision them. Terraform CLI is free for end users, but teams that want managed state, RBAC, and policy enforcement need Terraform Cloud or Enterprise, which adds per-user cost. Pulumi follows a similar model with Pulumi Cloud.
The BSL license question matters primarily for companies building products that compete with HashiCorp. If you are using Terraform to manage your own infrastructure, the BSL license has no practical impact. If you are building a managed Terraform service, you need to evaluate the license terms carefully, or consider OpenTofu, which maintains MPL 2.0 licensing and tracks the Terraform 1.5.x feature set with ongoing community additions.
Migration Paths
Migrating between IaC tools is a reality that many organizations face. Here is the current state of migration tooling:
| Migration Path | Tool | Maturity | Notes |
|---|---|---|---|
| CloudFormation → CDK | cdk migrate | Moderate | Generates L1 constructs (verbose); manual refactoring to L2/L3 needed |
| CloudFormation → Terraform | cf2tf, former2 | Moderate | Generates HCL from existing stacks; requires manual refinement |
| Terraform → Pulumi | pulumi convert --from terraform | Good | Converts HCL to Pulumi code; handles most resources well |
| Terraform → CDK | cdktf (CDK for Terraform) | Moderate | An alternative rather than a migration tool: write CDK constructs targeting Terraform |
| Any → Any (resource import) | terraform import, pulumi import, CloudFormation import | Varies | Import existing resources into management without recreation |
| Terraform → OpenTofu | Drop-in replacement | High | Binary-compatible for Terraform ≤ 1.5.x; diverging for newer features |
The most common migration I see in practice is CloudFormation → Terraform. Organizations start with CloudFormation because it is there, accumulate hundreds of stacks, and eventually hit the pain points (verbosity, speed, lack of multi-cloud support) that make Terraform attractive. The migration is typically gradual; new infrastructure goes into Terraform while existing CloudFormation stacks are migrated opportunistically. Trying to migrate everything at once is almost always a mistake.
The easiest migration is Terraform → OpenTofu. Since OpenTofu is a fork of Terraform 1.5.x, existing Terraform configurations, state files, and providers work with minimal or no changes. The migration is literally replacing the terraform binary with tofu in most cases.
Common Failure Modes
Every IaC tool has characteristic failure modes. Understanding them is essential for building reliable infrastructure practices.
CloudFormation: Stack Update Rollback Loops
When a CloudFormation stack update fails, the service automatically rolls back to the previous state. But if the rollback itself fails (because a resource was manually deleted, a dependency changed outside CloudFormation, or an API rate limit prevents the rollback operations), the stack enters UPDATE_ROLLBACK_FAILED state. At this point, you cannot update, roll back, or (easily) delete the stack.
Recovery: Use ContinueUpdateRollback with the --resources-to-skip parameter to skip the specific resources causing the rollback failure. If that does not work, you may need to recreate the missing resources manually so the rollback can complete, or as a last resort, delete the entire stack (which deletes all resources in it).
Prevention: Avoid manual modifications to CloudFormation-managed resources. Use change sets to preview updates. Keep stacks small enough that failures affect a limited set of resources.
CDK: Construct Version Conflicts
As CDK construct libraries evolve, teams consuming shared constructs can encounter version conflicts where two constructs require different versions of the same CDK core library. This manifests as TypeScript compilation errors, unexpected synthesis behavior, or runtime errors during deployment.
Recovery: Align all construct library versions. Use npm ls or equivalent to identify the dependency tree and find conflicts. Pin construct library versions in your package.json.
Prevention: Establish a construct library versioning strategy early. Use a monorepo or consistent versioning across shared construct libraries. Run cdk doctor regularly to identify potential issues.
Terraform: State File Corruption
State corruption typically occurs when an apply is interrupted (network failure, CI job timeout, operator ctrl-C) mid-operation. Terraform may have created some resources and updated the state for them, but not yet completed the full operation. The state file is now partially updated; some resources exist but others do not, and the state may not accurately reflect reality.
Recovery: If using S3 versioning (which you should be), restore the previous state version and run terraform plan to see what needs to be reconciled. Use terraform state commands to manually add, remove, or modify resource entries. In severe cases, delete the state and reimport all resources.
Prevention: Always use remote state with locking. Never interrupt a running apply unless absolutely necessary. Use -target for incremental changes to large configurations. Run terraform plan after any failed apply to assess the state.
Terraform: Provider Version Drift
When a Terraform provider is updated, it may change the behavior of existing resources: new required attributes, changed defaults, deprecated arguments. If your lock file (.terraform.lock.hcl) is not committed or provider versions are not pinned, different team members or CI environments may use different provider versions, leading to inconsistent plans and applies.
Recovery: Pin the provider version, run terraform init -upgrade to get the specific version, and then terraform plan to see if the new version changes any resource behavior.
Prevention: Always commit .terraform.lock.hcl. Pin provider versions in your required_providers block. Update providers deliberately (not as a side effect of terraform init).
Pulumi: Stack Reference Cycles
When Pulumi stacks reference each other's outputs, circular dependencies can emerge. Stack A references Stack B's output, and Stack B references Stack A's output. This is not caught at compile time; it manifests as deployment failures when the referenced output does not yet exist.
Recovery: Break the cycle by introducing a third stack that both A and B reference, or by using a shared configuration store (SSM Parameter Store) as an intermediary.
Prevention: Design your stack dependency graph as a DAG (directed acyclic graph) from the start. Document stack dependencies. Use Pulumi's stack reference types to make dependencies explicit.
Key Architectural Recommendations
After years of operating all four tools across teams ranging from 3 engineers to 300, here are the patterns I recommend:
- Default to Terraform for new organizations. Its community, module ecosystem, and CI/CD tooling are unmatched. The operational overhead of state management is well-understood and manageable. Multi-cloud support future-proofs your investment even if you are AWS-only today.
- Choose CDK when your team is AWS-only and values developer experience. CDK's high-level constructs genuinely accelerate AWS-specific development, and the testing story is strong. Accept that you are inheriting CloudFormation's deployment characteristics and plan accordingly.
- Consider Pulumi for greenfield projects with strong engineering teams. If your team is already writing TypeScript or Python daily and wants to apply software engineering practices to infrastructure, Pulumi's testing and programming model is compelling. The smaller community is a real drawback, but it is improving.
- Never store Terraform or Pulumi state locally in production. Use S3 + DynamoDB for Terraform, or Pulumi Cloud / S3 for Pulumi. Enable versioning. Restrict access. Treat state as a sensitive artifact.
- Keep blast radius small. Whether it is CloudFormation stacks, Terraform root modules, or Pulumi stacks, size them so that a failure in one does not cascade. A single Terraform state file managing your entire AWS account is a ticking time bomb. Split by domain: networking, compute, data, monitoring.
- Invest in plan review as a deployment gate. The plan/preview output is the most valuable artifact your IaC tool produces. Require plan review in CI before any apply to production. Automate plan posting to pull requests. Train your team to read plans carefully; the "1 to destroy" buried in a large plan is often the line that matters most.
- Use policy-as-code from the start. Whether it is Sentinel, OPA, CDK Aspects, or Pulumi CrossGuard, automated policy enforcement catches misconfigurations that human reviewers miss. Start with basic rules (no public S3 buckets, encryption required, no overly permissive security groups) and expand.
- Standardize and share modules/constructs. The biggest productivity gain in IaC comes from reusable, opinionated modules that encode your organization's best practices. A well-built VPC module that every team uses means every VPC in your organization has consistent CIDR allocation, subnet sizing, flow log configuration, and security group rules.
- Do not mix IaC tools for the same resource. If a resource is managed by Terraform, do not also manage it with CloudFormation or manually through the console. Mixed management is the most common source of drift, conflicts, and operational surprises. Pick one tool per resource and be disciplined about it.
- Plan for migration, even if you do not migrate. Whatever tool you choose today, you may need to migrate from it in 3-5 years. Keep your infrastructure definitions modular, well-documented, and avoid deep coupling to tool-specific features that would make migration prohibitively expensive.
- Test your disaster recovery procedures. Can you recover from a corrupted state file? Can you recreate a stack from scratch? Can you import existing resources into a new configuration? Test these scenarios before you need them. The middle of an incident is the wrong time to learn that your state backup strategy has a gap.
- Treat IaC as software, not as configuration. Apply the same practices you use for application code: version control, code review, testing, CI/CD, documentation. The teams that treat their Terraform code as a second-class citizen are the same teams that deploy breaking changes to production on Friday afternoons.
Additional Resources
- AWS CloudFormation Documentation
- AWS CDK Documentation
- CDK Construct Hub
- Terraform Documentation
- Terraform Registry
- OpenTofu Documentation
- Pulumi Documentation
- Pulumi Registry
- Terratest Documentation
- Spacelift: Terraform CI/CD
- Atlantis: Terraform Pull Request Automation
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

