About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
Every e-commerce team starts with a synchronous checkout. The API receives a cart, charges the card, decrements inventory, and returns a confirmation. It works until it doesn't. Payment processors time out. Warehouses operate on batch cycles. Inventory reservations race against each other across regions. I have rebuilt checkout and fulfillment pipelines three times across different organizations, and every rebuild ended at the same place: an asynchronous state machine with compensating transactions. AWS Step Functions is the right tool for this job, and this article covers the specific patterns, cost math, and operational lessons from running cart-to-delivery workflows in production.
If you want the general Step Functions architecture reference, see AWS Step Functions: An Architecture Deep-Dive. This article focuses on a single domain: the async cart and product fulfillment workflow, from the moment a customer clicks "Place Order" to the moment the shipment confirmation lands in their inbox.
Why Cart and Fulfillment Demands a State Machine
A checkout workflow looks simple on a whiteboard. Five or six boxes connected by arrows. In production, those boxes hide serious complexity.
The Sequential Dependency Problem
Cart operations involve steps that depend on each other and steps that must undo themselves when downstream operations fail. Reserve inventory before charging the card. Charge the card before committing the order. Generate a shipping label before notifying the warehouse. If shipping label generation fails, you need to refund the card and release the inventory reservation. This chain of forward operations and compensating rollbacks is the saga pattern, and it maps directly to a Step Functions state machine.
Without a state machine, teams scatter this logic across Lambda functions, SQS consumers, and application code. Each component knows about the next component in the chain. Each one handles its own retries and manages its own state. When the payment service fails halfway through, nobody knows whether inventory was already reserved. Was the customer charged? Did the warehouse receive a pick request? Debugging requires correlating logs across six services, and the answer is usually "it depends on the timing."
What the State Machine Gives You
Step Functions centralizes the orchestration. Each service does its job and returns a result. The state machine decides what happens next, what to retry, and what to undo. Three properties make this valuable for commerce workflows:
| Property | Without Step Functions | With Step Functions |
|---|---|---|
| State visibility | Correlate logs across services | Single execution view shows every step |
| Compensation logic | Scattered across consumers | Centralized in Catch blocks |
| Retry policy | Hardcoded per service | Declarative per state with exponential backoff |
| Execution guarantee | At-least-once with manual dedup | Exactly-once for Standard workflows |
| Timeout handling | Custom per service | HeartbeatSeconds and TimeoutSeconds per state |
Anatomy of the Cart-to-Delivery Workflow
The workflow spans three phases: cart finalization, payment and reservation, and physical fulfillment. Each phase has different latency characteristics, failure modes, and cost profiles.
Phase 1: Cart Validation and Lock
When the customer clicks "Place Order," the workflow starts. The first states validate the cart contents against current catalog data, check inventory availability, and place a soft lock on the reserved items. This phase runs in milliseconds. Every step is a direct DynamoDB read or conditional write.
Phase 2: Payment and Order Commitment
Payment authorization is the first truly async operation. The workflow sends a payment request to the processor and waits for a callback. This is where the .waitForTaskToken pattern earns its place: Step Functions pauses execution, the payment processor does its work (fraud checks, bank authorization, 3DS challenges), and when finished, it calls back with the task token to resume the workflow. No polling. No wasted compute. The execution can pause for minutes or hours at zero cost.
After payment confirmation, the workflow commits the order: finalizes inventory deductions, writes the order record, and publishes an OrderPlaced event for downstream systems.
Phase 3: Fulfillment Pipeline
Physical fulfillment operates on a different timescale. Warehouse management systems process pick lists in batches. Shipping carriers generate labels asynchronously. Tracking numbers propagate through carrier APIs with variable delay. This phase uses callback patterns heavily, with each integration point pausing the workflow until an external system signals completion.
flowchart TD
A[Place Order
Received] --> B[Validate Cart
Items]
B --> C{Items
Available?}
C -->|No| D[Notify Customer
Out of Stock]
C -->|Yes| E[Reserve
Inventory]
E --> F[Authorize Payment
waitForTaskToken]
F --> G{Payment
Approved?}
G -->|No| H[Release
Inventory]
G -->|Yes| I[Commit Order
to DynamoDB]
I --> J[Generate
Shipping Label]
J --> K[Queue Warehouse
Pick Request]
K --> L[Wait for Shipment
waitForTaskToken]
L --> M[Send Tracking
to Customer]
M --> N[Order
Complete] The Saga Pattern: Forward Steps and Compensating Transactions
The saga pattern is the backbone of any reliable commerce workflow. Each forward step in the workflow has a corresponding compensation step that undoes its work. When any step fails, the workflow executes compensations in reverse order to restore consistency.
Mapping Forward Steps to Compensations
Every forward operation that mutates state needs a compensation. Read-only steps (cart validation, inventory checks) do not. Here is the full mapping for a cart-to-delivery workflow:
| Forward Step | Mutation | Compensation Step | Compensation Action |
|---|---|---|---|
| Reserve Inventory | DynamoDB conditional decrement | Release Inventory | DynamoDB conditional increment |
| Authorize Payment | Payment processor hold | Void Authorization | Payment processor void |
| Commit Order | DynamoDB order record created | Cancel Order | Update order status to "cancelled" |
| Charge Payment | Payment processor capture | Refund Payment | Payment processor refund |
| Generate Shipping Label | Carrier API label created | Void Label | Carrier API void (if supported) |
| Queue Pick Request | SQS message to warehouse | Cancel Pick | SQS cancellation message |
Implementing Compensation in ASL
The saga compensation maps to Step Functions Catch blocks. Each task state specifies a Catch that routes to the appropriate compensation chain. The compensation chain runs in reverse order: if the "Charge Payment" step fails, the workflow runs "Cancel Order," then "Void Authorization," then "Release Inventory."
The critical detail is ResultPath. Set it to something like $.error in your Catch block so the error information gets appended to the state input rather than replacing it. Without this, the compensation steps lose access to the order data they need to perform rollbacks.
flowchart TD
A[Reserve
Inventory] --> B[Authorize
Payment]
B --> C[Commit
Order]
C --> D[Charge
Payment]
D -->|Failure| E[Cancel
Order]
E --> F[Void
Authorization]
F --> G[Release
Inventory]
G --> H[Notify Customer
of Failure]
D -->|Success| I[Continue to
Fulfillment] DynamoDB as the Saga Log
Every saga needs a durable log that records which steps completed. DynamoDB is the natural choice on AWS. I use a single orders table with a composite key (orderId as partition key, stepName as sort key) where each completed step writes a record with its output data. Compensation steps query this table to know what to undo and to confirm they have the data needed for rollback.
This table also solves observability. When something goes wrong, I query the orders table for a given orderId and see exactly which steps completed, which failed, and what data each step produced. Combined with the Step Functions execution history, debugging becomes straightforward.
Standard vs. Express: Choosing Workflow Types for Each Phase
The cart-to-delivery workflow spans timeframes from milliseconds to days. Using a single Standard workflow for the entire pipeline wastes money on the fast phases and works perfectly for the slow phases. The right approach is a hybrid.
When Standard Workflows Earn Their Cost
Standard workflows cost $0.025 per 1,000 state transitions. They provide exactly-once execution semantics, up to one year of execution duration, and full execution history retrieval through the API. For the payment-and-commitment phase of a cart workflow, Standard is the only viable option because:
- Payment callbacks can take minutes (3DS challenges, fraud review queues)
- The saga pattern requires exactly-once guarantees to avoid double-charging
- Execution history must be available for dispute resolution and audit trails
StartExecutionis idempotent for Standard workflows when called with the same name and input, giving you natural deduplication
When Express Workflows Save Money
Express workflows cost $1.00 per million requests plus duration-based charges. They run for a maximum of 5 minutes and provide at-least-once execution semantics. For the cart validation phase (inventory checks, price verification, coupon validation), Express workflows handle the throughput at a fraction of the cost.
| Characteristic | Standard Workflow | Express Workflow |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution semantics | Exactly-once | At-least-once |
| Pricing model | $0.025 per 1,000 transitions | $1.00 per 1M requests + duration |
| State transition rate | 800/sec (soft) | 100,000/sec |
| Execution history | Full API retrieval, 90-day retention | CloudWatch Logs only |
| Idempotent start | Yes (same name + input) | No |
| Use in cart workflow | Payment, saga, fulfillment | Cart validation, notifications |
The Hybrid Architecture
I structure the cart workflow as a Standard parent workflow that invokes Express child workflows for high-throughput, short-lived operations. The parent manages the saga lifecycle. The children handle validation, notification fanout, and other idempotent bursts.
A 10,000-order-per-day storefront with an average of 12 state transitions per order in the Standard workflow and 8 transitions in Express child workflows costs roughly:
| Component | Calculation | Monthly Cost |
|---|---|---|
| Standard parent | 300K orders x 12 transitions = 3.6M transitions | $90.00 |
| Express children | 300K orders x 1 invocation = 300K requests + ~150ms avg at 64MB | ~$2.50 |
| Lambda invocations | ~15 per order x 300K = 4.5M invocations | ~$2.70 |
| DynamoDB (saga log) | ~10 writes + 5 reads per order | ~$4.50 |
| Total orchestration | ~$99.70 |
That is $100/month to orchestrate 300,000 orders with full saga guarantees, execution history, and callback-based async integration. The alternative (building this in application code with SQS and custom state management) costs more in engineering time than Step Functions will ever cost in AWS bills.
The Callback Pattern for Async Integration
The .waitForTaskToken callback pattern is the single most important integration pattern for fulfillment workflows. External systems (payment processors, warehouse management, shipping carriers) operate asynchronously. The callback pattern lets Step Functions pause without consuming resources and resume when the external system finishes.
How waitForTaskToken Works
When a task state uses the .waitForTaskToken suffix in its Resource ARN, Step Functions generates a unique task token and includes it in the task input. The task can pass this token to an external system (via SQS, SNS, Lambda, or API Gateway). The execution pauses. When the external system finishes, it calls SendTaskSuccess or SendTaskFailure with the token to resume execution.
The execution can wait for up to one year at zero additional cost. No polling loops. No idle compute. The state transition charge is paid once when the state enters and once when it exits.
Payment Authorization Example
For payment authorization, the workflow sends a message to an SQS queue that the payment service consumes. The message includes the task token, order details, and payment method. The payment service processes the authorization (which may involve external 3DS redirects, fraud scoring, and bank communication) and calls SendTaskSuccess with the authorization code or SendTaskFailure with the decline reason.
| Integration Point | Transport | Token Storage | Typical Wait Time |
|---|---|---|---|
| Payment authorization | SQS queue | DynamoDB mapping table | 2 seconds to 30 minutes |
| Fraud review (manual) | SQS + human workflow | DynamoDB mapping table | 1 hour to 48 hours |
| Warehouse pick/pack | SQS to WMS | DynamoDB mapping table | 15 minutes to 4 hours |
| Shipping label generation | Direct SDK integration | In-memory (fast) | 1 to 5 seconds |
| Carrier pickup scan | EventBridge from webhook | DynamoDB mapping table | 2 hours to 24 hours |
Storing Task Tokens
The external system needs to map its internal job ID back to the Step Functions task token. I store this mapping in a DynamoDB table with the external job ID as the partition key and the task token as an attribute. When the external system completes, it looks up the task token by its job ID and calls SendTaskSuccess. Simple, durable, and fast.
Set HeartbeatSeconds on every callback state. If the external system silently fails (crashes, loses the message, drops the request), the heartbeat timeout catches it. Without this, the execution hangs indefinitely. I use heartbeat values of 2x the expected processing time: if payment authorization normally takes 10 seconds, set the heartbeat to 20 seconds. If the warehouse typically picks within 2 hours, set it to 4 hours.
Error Handling and Retry Strategy
Commerce workflows have zero tolerance for silent failures. A lost order is a lost customer. The retry and catch mechanics in Step Functions handle both transient glitches and permanent failures, but getting the configuration right requires understanding the failure taxonomy.
Transient vs. Permanent Failures
| Failure Type | Examples | Strategy | Step Functions Mechanism |
|---|---|---|---|
| Transient | Lambda throttle, DynamoDB provisioned throughput exceeded, network timeout | Retry with backoff | Retry with IntervalSeconds, BackoffRate, MaxAttempts |
| Permanent | Payment declined, item discontinued, address undeliverable | Compensate and notify | Catch block routing to compensation chain |
| Ambiguous | Payment processor timeout (charged or not?) | Query then decide | Lambda that checks payment status before compensating |
| Infrastructure | Service outage, region degradation | Wait and retry longer | Retry with high MaxAttempts and long MaxDelaySeconds |
The ambiguous category is the dangerous one. A payment processor timeout does not tell you whether the charge went through. Blindly retrying may double-charge the customer. Blindly compensating may void a successful charge. The correct pattern is a verification Lambda that queries the payment processor for the transaction status before deciding whether to retry or compensate.
Retry Configuration for Commerce
My standard retry configuration for commerce workflows:
{
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2.0,
"MaxDelaySeconds": 60
},
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 5,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "CompensationChain"
}
]
}
The first retrier handles AWS infrastructure transience (Lambda cold starts failing, throttles, transient service errors) with aggressive retry. The second retrier handles application-level task failures with fewer attempts. The Catch block captures anything that survives both retriers and routes to compensation.
Dead Letter Queues for Callback Failures
When an external system never calls back and the heartbeat expires, the execution fails. Route these failures to an SQS dead letter queue through the compensation chain. A separate process monitors the DLQ and triggers investigation workflows (check the external system, verify the order state, alert the operations team). I have seen warehouse integrations silently drop callback messages when their SQS consumer crashes during deployment. The heartbeat timeout caught it within 30 minutes instead of discovering it the next morning through customer complaints.
Idempotency: The Non-Negotiable Requirement
Every Lambda function in a fulfillment workflow must be idempotent. Step Functions Standard workflows provide exactly-once state transitions, but the services they invoke may receive duplicate calls during retries. If "Charge Payment" retries after a timeout, the Lambda must check whether the charge already completed before attempting it again.
Idempotency Keys
Use the order ID combined with the step name as an idempotency key. Before performing any mutation, the Lambda checks DynamoDB for a record matching that key. If found, it returns the stored result. If not found, it performs the operation, writes the result to DynamoDB, and returns.
This is the same DynamoDB table used for the saga log. Each record includes the idempotency key, the operation result, and a TTL for cleanup. Payment processors typically accept their own idempotency keys (Stripe uses the Idempotency-Key header), so pass the same order-step composite key to the processor as well.
Express Workflow Deduplication
Express workflows execute at-least-once. If your parent Standard workflow invokes an Express child for cart validation, that child may run twice for the same input. Cart validation reads are naturally idempotent (reading inventory levels twice returns the same answer). Notification sends are not. Use SQS FIFO queues with message deduplication IDs for any Express workflow step that sends external communications.
Operational Lessons and Production Failure Modes
The 25,000 Event History Limit
Step Functions Standard workflows have a hard limit of 25,000 events in execution history. Each state entry, exit, retry, and data pass generates events. A fulfillment workflow with 15 states, 3 retries each, and rich input/output can burn through events faster than you expect. At event 24,999, the execution waits for one final event. If that event is ExecutionSucceeded, the workflow completes. If it is anything else, the workflow fails.
For long-running fulfillment workflows (physical delivery can take days), use child workflows for the fulfillment phase. Each child workflow gets its own 25,000-event budget. The parent workflow uses a single callback state that waits for the child to complete, consuming only 2 events in the parent's history.
The 256 KB Payload Limit
State input and output are limited to 256 KB. Cart data for orders with many line items, customization options, and metadata can exceed this. Store the full cart in DynamoDB or S3 and pass only the reference (order ID) through the workflow. Every Lambda function reads the current cart state from DynamoDB using the order ID. This also eliminates stale data problems: if a compensation step modifies the cart, the next step reads the updated version.
Execution Name Collisions
Standard workflow execution names must be unique within 90 days. Using the order ID as the execution name gives you natural deduplication (retrying StartExecution with the same name and input returns success without creating a duplicate). But if an order ID is reused within 90 days (test environments, retry after cancellation), the call fails. I append a version number to the execution name: order-12345-v1, order-12345-v2.
| Quota | Value | Type | Workaround |
|---|---|---|---|
| Execution history events | 25,000 | Hard | Child workflows for long-running phases |
| Max execution duration | 1 year | Hard | Sufficient for any fulfillment workflow |
| Payload size | 256 KB | Hard | Store data in DynamoDB, pass references |
| Open executions per account | 1,000,000 | Soft | Request increase for high-volume stores |
| State transitions per second | 800 | Soft | Request increase; use Express for bursts |
| Task token validity | 1 year | Hard | Set HeartbeatSeconds well below this |
| Execution name uniqueness | 90 days | Hard | Append version to execution name |
Key Patterns
After running variations of this architecture across three different commerce platforms, the patterns that consistently matter are:
- Use Standard workflows for the saga core. The exactly-once guarantee and execution history justify the cost for any workflow that touches payment and inventory.
- Use Express child workflows for validation and notification. Cart validation, stock checks, and email/SMS fanout are short-lived and idempotent. Express handles them at 25x lower cost per invocation.
- Use
.waitForTaskTokenfor every external integration. Payment processors, warehouse systems, and shipping carriers are async. The callback pattern pauses for free. Polling wastes money and adds complexity. - Map every forward step to a compensation. Write the compensation before writing the forward step. Test the compensation path as thoroughly as the happy path.
- Store saga state in DynamoDB, not in the workflow payload. Pass order IDs through the workflow. Let Lambda functions read current state from DynamoDB. This avoids the 256 KB limit and keeps data fresh across retries.
- Set HeartbeatSeconds on every callback state. Silent failures in external systems are the most dangerous failure mode. Heartbeat timeouts are your safety net.
- Make every Lambda idempotent. Retries will happen. Duplicates will happen. Use the order ID plus step name as the idempotency key everywhere.
Additional Resources
- AWS Step Functions Developer Guide
- Implement the Serverless Saga Pattern Using AWS Step Functions
- Saga Orchestration Pattern (AWS Prescriptive Guidance)
- Building Cost-Effective AWS Step Functions Workflows
- Integrating AWS Step Functions Callbacks and External Systems
- Handle Unpredictable Processing Times with Operational Consistency
- Step Functions Best Practices
- AWS Step Functions Service Quotas
- Building a Serverless Distributed Application Using a Saga Orchestration Pattern
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

