Skip to main content

Step Functions for Cart and Fulfillment: Async Workflow Patterns That Survive Production

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

Every e-commerce team starts with a synchronous checkout. The API receives a cart, charges the card, decrements inventory, and returns a confirmation. It works until it doesn't. Payment processors time out. Warehouses operate on batch cycles. Inventory reservations race against each other across regions. I have rebuilt checkout and fulfillment pipelines three times across different organizations, and every rebuild ended at the same place: an asynchronous state machine with compensating transactions. AWS Step Functions is the right tool for this job, and this article covers the specific patterns, cost math, and operational lessons from running cart-to-delivery workflows in production.

If you want the general Step Functions architecture reference, see AWS Step Functions: An Architecture Deep-Dive. This article focuses on a single domain: the async cart and product fulfillment workflow, from the moment a customer clicks "Place Order" to the moment the shipment confirmation lands in their inbox.

Why Cart and Fulfillment Demands a State Machine

A checkout workflow looks simple on a whiteboard. Five or six boxes connected by arrows. In production, those boxes hide serious complexity.

The Sequential Dependency Problem

Cart operations involve steps that depend on each other and steps that must undo themselves when downstream operations fail. Reserve inventory before charging the card. Charge the card before committing the order. Generate a shipping label before notifying the warehouse. If shipping label generation fails, you need to refund the card and release the inventory reservation. This chain of forward operations and compensating rollbacks is the saga pattern, and it maps directly to a Step Functions state machine.

Without a state machine, teams scatter this logic across Lambda functions, SQS consumers, and application code. Each component knows about the next component in the chain. Each one handles its own retries and manages its own state. When the payment service fails halfway through, nobody knows whether inventory was already reserved. Was the customer charged? Did the warehouse receive a pick request? Debugging requires correlating logs across six services, and the answer is usually "it depends on the timing."

What the State Machine Gives You

Step Functions centralizes the orchestration. Each service does its job and returns a result. The state machine decides what happens next, what to retry, and what to undo. Three properties make this valuable for commerce workflows:

PropertyWithout Step FunctionsWith Step Functions
State visibilityCorrelate logs across servicesSingle execution view shows every step
Compensation logicScattered across consumersCentralized in Catch blocks
Retry policyHardcoded per serviceDeclarative per state with exponential backoff
Execution guaranteeAt-least-once with manual dedupExactly-once for Standard workflows
Timeout handlingCustom per serviceHeartbeatSeconds and TimeoutSeconds per state

Anatomy of the Cart-to-Delivery Workflow

The workflow spans three phases: cart finalization, payment and reservation, and physical fulfillment. Each phase has different latency characteristics, failure modes, and cost profiles.

Phase 1: Cart Validation and Lock

When the customer clicks "Place Order," the workflow starts. The first states validate the cart contents against current catalog data, check inventory availability, and place a soft lock on the reserved items. This phase runs in milliseconds. Every step is a direct DynamoDB read or conditional write.

Phase 2: Payment and Order Commitment

Payment authorization is the first truly async operation. The workflow sends a payment request to the processor and waits for a callback. This is where the .waitForTaskToken pattern earns its place: Step Functions pauses execution, the payment processor does its work (fraud checks, bank authorization, 3DS challenges), and when finished, it calls back with the task token to resume the workflow. No polling. No wasted compute. The execution can pause for minutes or hours at zero cost.

After payment confirmation, the workflow commits the order: finalizes inventory deductions, writes the order record, and publishes an OrderPlaced event for downstream systems.

Phase 3: Fulfillment Pipeline

Physical fulfillment operates on a different timescale. Warehouse management systems process pick lists in batches. Shipping carriers generate labels asynchronously. Tracking numbers propagate through carrier APIs with variable delay. This phase uses callback patterns heavily, with each integration point pausing the workflow until an external system signals completion.

flowchart TD
    A[Place Order
Received] --> B[Validate Cart
Items] B --> C{Items
Available?} C -->|No| D[Notify Customer
Out of Stock] C -->|Yes| E[Reserve
Inventory] E --> F[Authorize Payment
waitForTaskToken] F --> G{Payment
Approved?} G -->|No| H[Release
Inventory] G -->|Yes| I[Commit Order
to DynamoDB] I --> J[Generate
Shipping Label] J --> K[Queue Warehouse
Pick Request] K --> L[Wait for Shipment
waitForTaskToken] L --> M[Send Tracking
to Customer] M --> N[Order
Complete]
Complete cart-to-delivery workflow with async integration points

The Saga Pattern: Forward Steps and Compensating Transactions

The saga pattern is the backbone of any reliable commerce workflow. Each forward step in the workflow has a corresponding compensation step that undoes its work. When any step fails, the workflow executes compensations in reverse order to restore consistency.

Mapping Forward Steps to Compensations

Every forward operation that mutates state needs a compensation. Read-only steps (cart validation, inventory checks) do not. Here is the full mapping for a cart-to-delivery workflow:

Forward StepMutationCompensation StepCompensation Action
Reserve InventoryDynamoDB conditional decrementRelease InventoryDynamoDB conditional increment
Authorize PaymentPayment processor holdVoid AuthorizationPayment processor void
Commit OrderDynamoDB order record createdCancel OrderUpdate order status to "cancelled"
Charge PaymentPayment processor captureRefund PaymentPayment processor refund
Generate Shipping LabelCarrier API label createdVoid LabelCarrier API void (if supported)
Queue Pick RequestSQS message to warehouseCancel PickSQS cancellation message

Implementing Compensation in ASL

The saga compensation maps to Step Functions Catch blocks. Each task state specifies a Catch that routes to the appropriate compensation chain. The compensation chain runs in reverse order: if the "Charge Payment" step fails, the workflow runs "Cancel Order," then "Void Authorization," then "Release Inventory."

The critical detail is ResultPath. Set it to something like $.error in your Catch block so the error information gets appended to the state input rather than replacing it. Without this, the compensation steps lose access to the order data they need to perform rollbacks.

flowchart TD
    A[Reserve
Inventory] --> B[Authorize
Payment] B --> C[Commit
Order] C --> D[Charge
Payment] D -->|Failure| E[Cancel
Order] E --> F[Void
Authorization] F --> G[Release
Inventory] G --> H[Notify Customer
of Failure] D -->|Success| I[Continue to
Fulfillment]
Saga compensation flow for payment failure

DynamoDB as the Saga Log

Every saga needs a durable log that records which steps completed. DynamoDB is the natural choice on AWS. I use a single orders table with a composite key (orderId as partition key, stepName as sort key) where each completed step writes a record with its output data. Compensation steps query this table to know what to undo and to confirm they have the data needed for rollback.

This table also solves observability. When something goes wrong, I query the orders table for a given orderId and see exactly which steps completed, which failed, and what data each step produced. Combined with the Step Functions execution history, debugging becomes straightforward.

Note
Always write the saga log entry before performing the forward operation, then update it after completion. If the operation fails between the write and the update, the log entry with "in_progress" status tells the compensation step exactly what it needs to clean up.

Standard vs. Express: Choosing Workflow Types for Each Phase

The cart-to-delivery workflow spans timeframes from milliseconds to days. Using a single Standard workflow for the entire pipeline wastes money on the fast phases and works perfectly for the slow phases. The right approach is a hybrid.

When Standard Workflows Earn Their Cost

Standard workflows cost $0.025 per 1,000 state transitions. They provide exactly-once execution semantics, up to one year of execution duration, and full execution history retrieval through the API. For the payment-and-commitment phase of a cart workflow, Standard is the only viable option because:

  • Payment callbacks can take minutes (3DS challenges, fraud review queues)
  • The saga pattern requires exactly-once guarantees to avoid double-charging
  • Execution history must be available for dispute resolution and audit trails
  • StartExecution is idempotent for Standard workflows when called with the same name and input, giving you natural deduplication

When Express Workflows Save Money

Express workflows cost $1.00 per million requests plus duration-based charges. They run for a maximum of 5 minutes and provide at-least-once execution semantics. For the cart validation phase (inventory checks, price verification, coupon validation), Express workflows handle the throughput at a fraction of the cost.

CharacteristicStandard WorkflowExpress Workflow
Max duration1 year5 minutes
Execution semanticsExactly-onceAt-least-once
Pricing model$0.025 per 1,000 transitions$1.00 per 1M requests + duration
State transition rate800/sec (soft)100,000/sec
Execution historyFull API retrieval, 90-day retentionCloudWatch Logs only
Idempotent startYes (same name + input)No
Use in cart workflowPayment, saga, fulfillmentCart validation, notifications

The Hybrid Architecture

I structure the cart workflow as a Standard parent workflow that invokes Express child workflows for high-throughput, short-lived operations. The parent manages the saga lifecycle. The children handle validation, notification fanout, and other idempotent bursts.

A 10,000-order-per-day storefront with an average of 12 state transitions per order in the Standard workflow and 8 transitions in Express child workflows costs roughly:

ComponentCalculationMonthly Cost
Standard parent300K orders x 12 transitions = 3.6M transitions$90.00
Express children300K orders x 1 invocation = 300K requests + ~150ms avg at 64MB~$2.50
Lambda invocations~15 per order x 300K = 4.5M invocations~$2.70
DynamoDB (saga log)~10 writes + 5 reads per order~$4.50
Total orchestration~$99.70

That is $100/month to orchestrate 300,000 orders with full saga guarantees, execution history, and callback-based async integration. The alternative (building this in application code with SQS and custom state management) costs more in engineering time than Step Functions will ever cost in AWS bills.

The Callback Pattern for Async Integration

The .waitForTaskToken callback pattern is the single most important integration pattern for fulfillment workflows. External systems (payment processors, warehouse management, shipping carriers) operate asynchronously. The callback pattern lets Step Functions pause without consuming resources and resume when the external system finishes.

How waitForTaskToken Works

When a task state uses the .waitForTaskToken suffix in its Resource ARN, Step Functions generates a unique task token and includes it in the task input. The task can pass this token to an external system (via SQS, SNS, Lambda, or API Gateway). The execution pauses. When the external system finishes, it calls SendTaskSuccess or SendTaskFailure with the token to resume execution.

The execution can wait for up to one year at zero additional cost. No polling loops. No idle compute. The state transition charge is paid once when the state enters and once when it exits.

Payment Authorization Example

For payment authorization, the workflow sends a message to an SQS queue that the payment service consumes. The message includes the task token, order details, and payment method. The payment service processes the authorization (which may involve external 3DS redirects, fraud scoring, and bank communication) and calls SendTaskSuccess with the authorization code or SendTaskFailure with the decline reason.

Integration PointTransportToken StorageTypical Wait Time
Payment authorizationSQS queueDynamoDB mapping table2 seconds to 30 minutes
Fraud review (manual)SQS + human workflowDynamoDB mapping table1 hour to 48 hours
Warehouse pick/packSQS to WMSDynamoDB mapping table15 minutes to 4 hours
Shipping label generationDirect SDK integrationIn-memory (fast)1 to 5 seconds
Carrier pickup scanEventBridge from webhookDynamoDB mapping table2 hours to 24 hours

Storing Task Tokens

The external system needs to map its internal job ID back to the Step Functions task token. I store this mapping in a DynamoDB table with the external job ID as the partition key and the task token as an attribute. When the external system completes, it looks up the task token by its job ID and calls SendTaskSuccess. Simple, durable, and fast.

Set HeartbeatSeconds on every callback state. If the external system silently fails (crashes, loses the message, drops the request), the heartbeat timeout catches it. Without this, the execution hangs indefinitely. I use heartbeat values of 2x the expected processing time: if payment authorization normally takes 10 seconds, set the heartbeat to 20 seconds. If the warehouse typically picks within 2 hours, set it to 4 hours.

Error Handling and Retry Strategy

Commerce workflows have zero tolerance for silent failures. A lost order is a lost customer. The retry and catch mechanics in Step Functions handle both transient glitches and permanent failures, but getting the configuration right requires understanding the failure taxonomy.

Transient vs. Permanent Failures

Failure TypeExamplesStrategyStep Functions Mechanism
TransientLambda throttle, DynamoDB provisioned throughput exceeded, network timeoutRetry with backoffRetry with IntervalSeconds, BackoffRate, MaxAttempts
PermanentPayment declined, item discontinued, address undeliverableCompensate and notifyCatch block routing to compensation chain
AmbiguousPayment processor timeout (charged or not?)Query then decideLambda that checks payment status before compensating
InfrastructureService outage, region degradationWait and retry longerRetry with high MaxAttempts and long MaxDelaySeconds

The ambiguous category is the dangerous one. A payment processor timeout does not tell you whether the charge went through. Blindly retrying may double-charge the customer. Blindly compensating may void a successful charge. The correct pattern is a verification Lambda that queries the payment processor for the transaction status before deciding whether to retry or compensate.

Retry Configuration for Commerce

My standard retry configuration for commerce workflows:

{
  "Retry": [
    {
      "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
      "IntervalSeconds": 2,
      "MaxAttempts": 6,
      "BackoffRate": 2.0,
      "MaxDelaySeconds": 60
    },
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 5,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "CompensationChain"
    }
  ]
}

The first retrier handles AWS infrastructure transience (Lambda cold starts failing, throttles, transient service errors) with aggressive retry. The second retrier handles application-level task failures with fewer attempts. The Catch block captures anything that survives both retriers and routes to compensation.

Dead Letter Queues for Callback Failures

When an external system never calls back and the heartbeat expires, the execution fails. Route these failures to an SQS dead letter queue through the compensation chain. A separate process monitors the DLQ and triggers investigation workflows (check the external system, verify the order state, alert the operations team). I have seen warehouse integrations silently drop callback messages when their SQS consumer crashes during deployment. The heartbeat timeout caught it within 30 minutes instead of discovering it the next morning through customer complaints.

Idempotency: The Non-Negotiable Requirement

Every Lambda function in a fulfillment workflow must be idempotent. Step Functions Standard workflows provide exactly-once state transitions, but the services they invoke may receive duplicate calls during retries. If "Charge Payment" retries after a timeout, the Lambda must check whether the charge already completed before attempting it again.

Idempotency Keys

Use the order ID combined with the step name as an idempotency key. Before performing any mutation, the Lambda checks DynamoDB for a record matching that key. If found, it returns the stored result. If not found, it performs the operation, writes the result to DynamoDB, and returns.

This is the same DynamoDB table used for the saga log. Each record includes the idempotency key, the operation result, and a TTL for cleanup. Payment processors typically accept their own idempotency keys (Stripe uses the Idempotency-Key header), so pass the same order-step composite key to the processor as well.

Express Workflow Deduplication

Express workflows execute at-least-once. If your parent Standard workflow invokes an Express child for cart validation, that child may run twice for the same input. Cart validation reads are naturally idempotent (reading inventory levels twice returns the same answer). Notification sends are not. Use SQS FIFO queues with message deduplication IDs for any Express workflow step that sends external communications.

Operational Lessons and Production Failure Modes

The 25,000 Event History Limit

Step Functions Standard workflows have a hard limit of 25,000 events in execution history. Each state entry, exit, retry, and data pass generates events. A fulfillment workflow with 15 states, 3 retries each, and rich input/output can burn through events faster than you expect. At event 24,999, the execution waits for one final event. If that event is ExecutionSucceeded, the workflow completes. If it is anything else, the workflow fails.

For long-running fulfillment workflows (physical delivery can take days), use child workflows for the fulfillment phase. Each child workflow gets its own 25,000-event budget. The parent workflow uses a single callback state that waits for the child to complete, consuming only 2 events in the parent's history.

The 256 KB Payload Limit

State input and output are limited to 256 KB. Cart data for orders with many line items, customization options, and metadata can exceed this. Store the full cart in DynamoDB or S3 and pass only the reference (order ID) through the workflow. Every Lambda function reads the current cart state from DynamoDB using the order ID. This also eliminates stale data problems: if a compensation step modifies the cart, the next step reads the updated version.

Execution Name Collisions

Standard workflow execution names must be unique within 90 days. Using the order ID as the execution name gives you natural deduplication (retrying StartExecution with the same name and input returns success without creating a duplicate). But if an order ID is reused within 90 days (test environments, retry after cancellation), the call fails. I append a version number to the execution name: order-12345-v1, order-12345-v2.

QuotaValueTypeWorkaround
Execution history events25,000HardChild workflows for long-running phases
Max execution duration1 yearHardSufficient for any fulfillment workflow
Payload size256 KBHardStore data in DynamoDB, pass references
Open executions per account1,000,000SoftRequest increase for high-volume stores
State transitions per second800SoftRequest increase; use Express for bursts
Task token validity1 yearHardSet HeartbeatSeconds well below this
Execution name uniqueness90 daysHardAppend version to execution name

Key Patterns

After running variations of this architecture across three different commerce platforms, the patterns that consistently matter are:

  1. Use Standard workflows for the saga core. The exactly-once guarantee and execution history justify the cost for any workflow that touches payment and inventory.
  2. Use Express child workflows for validation and notification. Cart validation, stock checks, and email/SMS fanout are short-lived and idempotent. Express handles them at 25x lower cost per invocation.
  3. Use .waitForTaskToken for every external integration. Payment processors, warehouse systems, and shipping carriers are async. The callback pattern pauses for free. Polling wastes money and adds complexity.
  4. Map every forward step to a compensation. Write the compensation before writing the forward step. Test the compensation path as thoroughly as the happy path.
  5. Store saga state in DynamoDB, not in the workflow payload. Pass order IDs through the workflow. Let Lambda functions read current state from DynamoDB. This avoids the 256 KB limit and keeps data fresh across retries.
  6. Set HeartbeatSeconds on every callback state. Silent failures in external systems are the most dangerous failure mode. Heartbeat timeouts are your safety net.
  7. Make every Lambda idempotent. Retries will happen. Duplicates will happen. Use the order ID plus step name as the idempotency key everywhere.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.