Skip to main content

Building Large-Scale SageMaker Training Pipelines with Step Functions

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

I have spent the last several months orchestrating ML training pipelines that coordinate dozens of SageMaker jobs: preprocessing, feature engineering, distributed training, hyperparameter tuning, evaluation, conditional deployment. The pattern I keep seeing is that teams pour effort into model architecture and training code while treating the orchestration layer as an afterthought. Then the orchestration layer is exactly where the ugliest production failures happen. This article is my architecture reference for building training pipelines on AWS Step Functions at scale. If you have already read my AWS Step Functions: An Architecture Deep-Dive, the execution model and state types will be familiar. Here we get into the problems specific to ML pipelines: training jobs that run for hours, spot instances that vanish mid-epoch, models that need human sign-off before they touch production traffic, and the retraining loops that keep everything from going stale.

Why Step Functions for ML Pipelines

You have to pick an orchestrator before you build anything else, and this choice will shape every decision downstream. I have built ML pipelines on all the major AWS options. They overlap in feature lists enough to make the decision confusing, but once you operate them in production, the differences become obvious fast.

CapabilityStep FunctionsSageMaker PipelinesMWAA (Airflow)Custom Lambda
Native SageMaker integrationService integration (optimized)Native (built-in)Via boto3 operatorsVia SDK calls
Visual debuggingExcellent: per-execution graphGood: pipeline DAGGood: DAG viewNone
Max execution duration1 year (Standard)No hard limitNo hard limit15 minutes per invocation
Branching/conditional logicChoice states, nativeConditions, limitedPython branchingCustom code
Parallel executionMap state, Parallel stateParallelism stepsTask parallelismFan-out pattern
Human approval gatesCallback pattern, nativeNot nativeManual sensorsCustom implementation
Cost modelPer state transitionFree (pay for jobs)Hourly environmentPer invocation
Error handlingRetry/Catch per stateRetry policiesTask retriesTry/except
Nested pipelinesNative (nested executions)Not supportedSubDAGs/TaskGroupsRecursive invocation
Event-driven triggersEventBridge nativeEventBridge/ScheduleExternal trigger onlyEventBridge native

My recommendation: Use Step Functions when your pipeline has conditional branching, human approval gates, or needs to reach beyond SageMaker into other AWS services. SageMaker Pipelines wins when your workflow is a straightforward DAG touching only SageMaker and you want zero orchestration cost. MWAA makes sense if your team already lives in Airflow or your pipeline spans multiple clouds.

The pipelines in this article are multi-stage, conditional, parallel, and include human approval. Step Functions wins handily.

Pipeline Architecture Overview

A production training pipeline is so much more than "call SageMaker and wait." You are building a multi-stage workflow that takes raw data and turns it into a deployed, monitored model. Every stage has to handle failures cleanly, support partial reruns (because you will need them), and produce auditable artifacts.

flowchart TD
    A[S3 Data Landing] --> B[Data Validation]
    B --> C{Data Quality
Gate}
    C -->|Pass| D[Feature Engineering]
    C -->|Fail| E[Alert & Halt]
    D --> F[Feature Store]
    F --> G[Training Jobs]
    G --> H[Model Evaluation]
    H --> I{Meets
Baseline?}
    I -->|Exceeds| J[Register Model]
    I -->|Below| K[Retrain with
New Hyperparameters]
    I -->|Far Below| L[Alert & Investigate]
    K --> G
    J --> M{Approval
Gate}
    M -->|Approved| N[Deploy to Endpoint]
    M -->|Rejected| L
    N --> O[Smoke Tests]
    O --> P[Production Traffic]
    P --> Q[Model Monitoring]
    Q --> R{Drift
Detected?}
    R -->|Yes| A
    R -->|No| Q
Multi-stage training pipeline architecture

Each box maps to one or more Step Functions states. Notice the cycle: model monitoring feeds back into data ingestion, forming an automated retraining loop. A single Step Functions execution handles one pass through the pipeline. When monitoring detects drift, a separate EventBridge rule kicks off a fresh execution.

Data Preparation Stage

In my experience, data preparation is where pipelines break most often. Raw data shows up with schema drift, missing values, unexpected cardinality, weird encoding. If you let bad data slide through, you burn GPU hours training on garbage. The data preparation sub-workflow exists to catch problems before they get expensive.

flowchart LR
    A[S3 Raw Data] --> B[Schema
Validation]
    B --> C{Schema
Valid?}
    C -->|Yes| D[SageMaker
Processing Job]
    C -->|No| E[Dead Letter
& Alert]
    D --> F[Feature
Engineering]
    F --> G[Statistical
Validation]
    G --> H{Distribution
Shift?}
    H -->|No| I[Write to
Feature Store]
    H -->|Yes| J[Flag for
Review]
    I --> K[Training
Dataset Ready]
    J --> K
Data preparation sub-workflow

Schema Validation

The first state runs a lightweight Lambda function that validates incoming data against an expected schema. A few cents per invocation. It catches the most common data problems before they trigger a Processing job that actually costs money.

I check for:

  • Column presence and types: All expected features present, numeric columns actually numeric
  • Cardinality bounds: Catches the case where a categorical column that should have 50 values suddenly has 50,000
  • Null ratios: Flags when the percentage of nulls in a critical column blows past acceptable thresholds
  • File format and encoding: Confirms Parquet/CSV format with correct encoding

SageMaker Processing Jobs

I use SageMaker Processing jobs with custom containers for feature engineering. I tried the built-in processors early on and kept hitting walls. Here is why:

ApproachProsCons
Built-in SKLearnProcessorQuick setup, managed containerLimited library versions, constrained customization
Built-in SparkProcessorDistributed processing, large datasetsHeavy startup time (~5 min), overkill for most jobs
Custom containerFull control, reproducible, any libraryMust maintain container, ECR costs
LambdaFast startup, cheap15-min limit, 10 GB memory limit, no GPU

Under 50 GB of data? A custom container on ml.m5.4xlarge handles feature engineering in minutes. Above 50 GB, SparkProcessor on a cluster becomes worthwhile despite the startup penalty.

Statistical Validation

After feature engineering, I run a statistical validation step that compares transformed feature distributions against a baseline stored in S3. Schema validation will not catch a feature that remains numeric but whose mean has shifted by three standard deviations. This step will. I store distribution baselines as JSON artifacts and run a Kolmogorov-Smirnov test with configurable significance thresholds. It has saved me from deploying models trained on silently corrupted data more than once.

Training Stage

This is where the real money gets spent. Training eats more compute than every other stage combined, which means cost optimization here has the most leverage by far. Get this stage right and you need distributed training across multiple instances, spot instances for cost savings, checkpoints for recovery, and optionally hyperparameter tuning.

Instance Selection

Instance selection is a cost-performance tradeoff driven by your model architecture, dataset size, and how long you are willing to wait.

Instance TypevCPUsGPUGPU MemoryOn-Demand $/hrSpot $/hr (typical)Best For
ml.m5.xlarge4NoneN/A$0.23$0.07Sklearn, XGBoost (small)
ml.m5.4xlarge16NoneN/A$0.92$0.28XGBoost, feature-heavy tabular
ml.g4dn.xlarge41x T416 GB$0.74$0.22Fine-tuning small models, inference testing
ml.g5.2xlarge81x A10G24 GB$1.52$0.46Medium transformer fine-tuning
ml.p3.2xlarge81x V10016 GB$3.83$1.15Large model training, mixed precision
ml.p3.8xlarge324x V10064 GB$14.69$4.41Distributed training, large batch
ml.p4d.24xlarge968x A100320 GB$37.69$11.31Foundation model fine-tuning
ml.trn1.32xlarge12816x Trainium512 GB$24.78$7.43Large-scale training (Trainium-compatible)

My rule of thumb: start with the smallest GPU instance that fits your model in memory, then scale horizontally only when single-instance training blows your time budget. Vertical scaling (bigger instance) is always simpler to operate than horizontal scaling (distributed training). You avoid the inter-node communication overhead entirely. I have watched teams jump straight to multi-node training and spend weeks debugging NCCL timeouts when a single larger instance would have been fine.

Spot Instance Strategy

Spot instances cut training costs by 60-70%. The catch: they can disappear at any moment. I handle this in the Step Functions state machine with a layered retry strategy:

  1. First attempt: Spot instance, checkpointing every 10 minutes
  2. On interruption: Retry on spot with checkpoint resume (most interruptions are transient)
  3. After 3 spot failures: Fall back to on-demand with checkpoint resume
  4. On-demand failure: Alert and halt. Something is fundamentally wrong at this point.

The retry configuration for the training state looks like this conceptually:

{
  "Retry": [
    {
      "ErrorEquals": ["SageMaker.SpotInterruption"],
      "IntervalSeconds": 60,
      "MaxAttempts": 3,
      "BackoffRate": 1.5
    },
    {
      "ErrorEquals": ["SageMaker.ResourceLimitExceeded"],
      "IntervalSeconds": 300,
      "MaxAttempts": 2,
      "BackoffRate": 2.0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "TrainingFailureHandler"
    }
  ]
}

Checkpoint Management

Checkpoints make spot instances viable. Without them, a spot interruption after 6 hours of training means 6 wasted hours. Period. With checkpoints saved every 10 minutes to S3, you lose 10 minutes of work at most. That tradeoff is worth a little S3 storage cost.

I organize checkpoints around three S3 paths:

PathPurposeLifecycle
s3://bucket/pipeline/{execution-id}/checkpoints/Active checkpoints for current training runDeleted after successful training
s3://bucket/pipeline/{execution-id}/model/Final model artifactsRetained per model registry policy
s3://bucket/pipeline/{execution-id}/logs/Training logs and metricsRetained for 90 days

The Step Functions execution ID flows through every SageMaker job as both a tag and an S3 path component. This gives you full traceability from any pipeline execution to the exact model artifact it produced. When something goes wrong six months later, you will be glad you wired this up.

Distributed Training

Sometimes a single instance just cannot finish training in the time you have. That is when you reach for distributed training. SageMaker supports two strategies:

StrategyHow It WorksBest ForOverhead
Data ParallelEach instance trains on a data shard, gradients synchronizedLarge datasets, models that fit on one GPULow: gradient sync only
Model ParallelModel layers split across instancesModels too large for single GPU memoryHigh: inter-node activations

For data parallelism, SageMaker's built-in data parallel library handles gradient synchronization automatically. The two things you need to get right are instance count and per-instance batch size. Scale the learning rate linearly with total batch size (the "linear scaling rule") and include a warmup period. Skip the warmup and you will watch your loss explode in the first few steps. I learned this the hard way.

Evaluation and Model Registry

Training finished. Now the pipeline has to answer one question: is this model better than what is already serving production traffic? That question is harder than it sounds. You need to compare multiple metrics across multiple data slices, check for regression on specific segments, and sometimes get a human to weigh in.

flowchart TD
    A[Trained Model] --> B[Batch Transform
Evaluation]
    B --> C[Compute Metrics
Across Slices]
    C --> D{Compare to
Baseline Model}
    D -->|All metrics improved| E[Auto-Promote]
    D -->|Mixed results| F[Human Review
Callback]
    D -->|All metrics degraded| G[Reject &
Investigate]
    E --> H[Register in
Model Registry]
    F --> I{Reviewer
Decision}
    I -->|Approve| H
    I -->|Reject| G
    H --> J[Model Package
with Metadata]
    G --> K[Create
Investigation Ticket]
Evaluation decision flow

Multi-Metric Evaluation

If you are making deployment decisions based on a single accuracy number, stop. I evaluate models on a matrix of metrics and data slices:

DimensionMetricsFailure Threshold
Overall performanceAccuracy, F1, AUC-ROCAny metric drops > 2% from baseline
Segment performancePer-segment F1, fairness metricsAny segment drops > 5%
LatencyP50, P95, P99 inference timeP99 exceeds endpoint SLA
Model sizeArtifact size, memory footprintExceeds deployment instance capacity
Data coverageFeature importance stabilityTop-10 features differ significantly

The evaluation step runs as a SageMaker Processing job that produces a JSON report. A Choice state in Step Functions parses this report and routes accordingly: auto-promote, human review, or reject.

Model Registry Integration

When a model passes evaluation, I register it in SageMaker Model Registry with all the metadata you will wish you had six months from now:

  • Model package group: Groups versions of the same model
  • Approval status: PendingManualApproval or Approved
  • Inference specification: Container image, instance types, input/output formats
  • Model metrics: All evaluation metrics attached as metadata
  • Pipeline execution ID: Link back to the Step Functions execution for full lineage
  • Training data hash: SHA-256 of the training dataset for reproducibility
  • Git commit: The commit hash of the training code

Full model lineage. From any production prediction, you can trace back to the exact training data, code version, and pipeline execution that produced the model. This matters when someone asks "why did the model do that?" and you need an answer within the hour.

Error Handling and Retry Strategies

ML pipelines fail in ways that web services do not. Training jobs run for hours. Spot instances vanish mid-epoch. GPU memory errors surface three hours into a run. Data processing jobs fail because someone upstream changed a schema without telling you. Your Step Functions error handling has to deal with all of this gracefully.

Failure Classification

Not all failures deserve the same response. I sort them into four categories:

CategoryExamplesResponseStep Functions Pattern
TransientSpot interruption, throttling, transient S3 errorAutomatic retry with backoffRetry with BackoffRate
ResourceInsufficient capacity, GPU OOMRetry with different configCatch → resize → retry state
DataSchema mismatch, corrupt recordsHalt and alert, no retryCatch → alert → fail
LogicNaN loss, divergent trainingRetry with different hyperparametersCatch → adjust → retry state

The Saga Pattern for Pipeline Rollback

Pipelines fail partway through. You register a model, start the deployment, and something blows up. Now you have partial state sitting around. You need compensating transactions to clean it up. I implement the saga pattern in Step Functions through a parallel error handling branch:

  1. Data preparation fails: No cleanup needed; no state was created
  2. Training fails: Delete partial checkpoints and model artifacts from S3
  3. Evaluation fails: Delete evaluation artifacts, deregister the model if already registered
  4. Deployment fails: Roll back endpoint to previous model version, deregister new model package

Each compensating action is a Lambda function invoked from the Catch handler. Make these idempotent. Running a compensating action twice should not produce additional failures. If it does, you now have two problems instead of one.

GPU Out-of-Memory Recovery

GPU OOM errors happen constantly when you are experimenting with batch sizes or model configurations. Your pipeline can recover automatically instead of failing and paging someone at 3 AM:

  1. Catch the AlgorithmError from the training job
  2. Parse the error message for OOM indicators
  3. Reduce the batch size by 50%
  4. Resume from the last checkpoint with the reduced batch size
  5. If batch size falls below a minimum threshold, switch to a larger instance type

This adaptive retry pattern alone has saved my team from countless failed overnight training runs.

Orchestration Patterns

Step Functions gives you several state types that map naturally to ML pipeline needs. The difference between a fragile pipeline and a robust one often comes down to knowing which patterns to combine and when.

Map State for Parallel Training

The Map state does the heavy lifting for parallel training. I use it for hyperparameter sweeps, cross-validation folds, and training multiple model variants at the same time.

flowchart TD
    A[Generate Training
Configurations] --> B[Map State:
Parallel Training]

    subgraph B[Map State: Parallel Training]
        direction TB
        C1[Config 1:
lr=0.001, batch=32] --> D1[Training Job 1]
        C2[Config 2:
lr=0.01, batch=64] --> D2[Training Job 2]
        C3[Config 3:
lr=0.001, batch=64] --> D3[Training Job 3]
        C4[Config N:
...] --> D4[Training Job N]
    end

    B --> E[Collect Results]
    E --> F[Select Best Model]
    F --> G[Evaluation Stage]
Parallel training orchestration with Map state

You feed it an array of training configurations and it launches a separate training job for each. A few configuration decisions matter here:

SettingRecommendationRationale
MaxConcurrency5-10AWS account limits on concurrent training jobs; higher values risk throttling
ToleratedFailurePercentage20-30%Some configurations will fail (OOM, divergence); do not halt the entire sweep
ItemBatcherBatch size 1Each training job is independent; batching adds no value
ResultSelectorExtract metrics onlyFull training output is large; select only what the next state needs

Nested Workflows for Reusable Sub-Pipelines

Do not build one monolithic state machine. Break it up into nested workflows:

  • Data preparation workflow: Reusable across multiple model training pipelines
  • Training workflow: Parameterized by algorithm, instance type, hyperparameters
  • Evaluation workflow: Reusable evaluation logic with configurable metrics and thresholds
  • Deployment workflow: Handles endpoint creation, traffic shifting, smoke tests

Each one is a separate Step Functions state machine invoked via the StartExecution service integration. The parent passes parameters in and gets the child's output back. You gain independent testing, versioning, and reuse for each piece. When the data team updates the feature engineering logic, they deploy their workflow without touching yours.

Callback Pattern for Human Approval

Any model that touches revenue-critical decisions needs a human to sign off before it goes live. No exceptions. The callback pattern handles this:

  1. The pipeline reaches the approval state and sends a task token to an SNS topic
  2. SNS triggers a notification (email, Slack, PagerDuty)
  3. The pipeline pauses; the Step Functions execution waits (up to 1 year)
  4. A reviewer examines the model metrics, evaluation report, and comparison to baseline
  5. The reviewer calls SendTaskSuccess or SendTaskFailure via an approval UI or CLI
  6. The pipeline resumes based on the reviewer's decision

I build a simple approval UI as a static page hosted on S3. It displays the evaluation report and gives the reviewer Approve/Reject buttons. Those buttons call API Gateway, which triggers a Lambda that calls SendTaskSuccess or SendTaskFailure. The Lambda validates the reviewer's identity against IAM and logs the decision for audit compliance. The whole thing takes an afternoon to build and saves you from the "someone deployed a model via Slack DM" problem.

Cost Optimization

ML training burns money. One ml.p4d.24xlarge instance: $37.69 per hour. A hyperparameter tuning job running 20 training jobs across 10 instances can clear $1,000 in a single run. You have to treat cost optimization as an architectural concern from day one, or the finance team will come knocking.

Spot Instance Savings

Instance FamilyOn-Demand $/hrTypical Spot $/hrSavingsInterruption Rate
ml.m5 (CPU)$0.23 - $3.69$0.07 - $1.11~70%Low (< 5%)
ml.g4dn (T4 GPU)$0.74 - $7.82$0.22 - $2.35~70%Low (< 5%)
ml.g5 (A10G GPU)$1.52 - $24.32$0.46 - $7.30~70%Moderate (5-15%)
ml.p3 (V100 GPU)$3.83 - $28.15$1.15 - $8.45~70%Moderate (5-15%)
ml.p4d (A100 GPU)$37.69$11.31~70%Higher (10-20%)

The savings are real. So is the interruption risk, especially on larger GPU instances. The checkpoint and retry strategy from the Training Stage section makes this work. Without it, spot instances are a gamble. With it, they are a legitimate 70% cost reduction.

Managed Warm Pools

SageMaker Managed Warm Pools keep instances provisioned between training jobs. No more 3-5 minute cold starts. This matters most for iterative development and hyperparameter sweeps where you launch job after job after job.

ScenarioWithout Warm PoolsWith Warm PoolsTime Saved
20-job HP sweep (sequential)60-100 min startup overhead~5 min total55-95 min
Daily retraining3-5 min startup per run~0 min3-5 min/day
Interactive development3-5 min per experiment~0 minSignificant

You pay for warm instances while they sit idle, so set KeepAlivePeriodInSeconds to match your actual usage. For hyperparameter sweeps, 3600 seconds (1 hour) works well. For daily retraining? Paying for 23 idle hours to save 4 minutes of startup is a bad trade. Do the math on your own workload.

Standard vs. Express Workflows

Step Functions has two workflow types. The cost models are completely different:

CharacteristicStandard WorkflowExpress Workflow
Max duration1 year5 minutes
PricingPer state transition ($0.025/1000)Per request + duration
Execution history90 days built-inCloudWatch Logs only
Exactly-once executionYesAt-least-once
Suitable for training pipelines?Yes: primary pipelineNo: too short for training
Suitable for data validation?OverkillYes: fast, cheap

I use a hybrid approach. The main pipeline runs as a Standard workflow because training jobs are long-running and I need exactly-once semantics. The lightweight data validation sub-workflow runs as an Express workflow nested inside. Fast, cheap, and I do not care if it runs at-least-once because the validation is idempotent anyway. Best of both worlds.

Monitoring and Observability

You need visibility into every stage of the pipeline. When a model starts misbehaving in production, the first question is always "what changed?" Was it the data? The training? The infrastructure? Without observability wired in from the start, you are guessing.

CloudWatch Integration

SageMaker training jobs emit metrics to CloudWatch automatically. Here is what I monitor and alert on:

MetricSourceAlert Threshold
train:lossTraining algorithmLoss increasing for > 3 epochs
validation:accuracyTraining algorithmBelow baseline by > 5%
GPUUtilizationSageMaker infrastructureBelow 50% (over-provisioned) or above 95% (throttled)
MemoryUtilizationSageMaker infrastructureAbove 90% (OOM risk)
DiskUtilizationSageMaker infrastructureAbove 80% (checkpoint storage)

Step Functions Execution History

Step Functions keeps execution history for 90 days on Standard workflows. For each execution, you get:

  • Input/output for every state: What parameters were passed, what results returned
  • Duration per state: Where the pipeline spends its time
  • Error details: Full error messages and stack traces for failed states
  • Retry attempts: How many retries occurred and why

My CloudWatch dashboards track:

  • Pipeline success rate: Percentage of executions that complete successfully
  • Stage duration trends: Whether training jobs are getting slower over time
  • Cost per execution: Total compute cost tracked via tags
  • Failure categorization: Which failure categories are most common

Custom Metrics Pipeline

The built-in metrics only get you so far. I push custom metrics from each pipeline stage:

StageCustom Metrics
Data preparationRecords processed, feature count, null ratio
TrainingFinal loss, best metric, total epochs, early stopping epoch
EvaluationAll comparison metrics, pass/fail decision
DeploymentEndpoint creation time, smoke test latency

All of this feeds a single CloudWatch dashboard. One screen, full picture of pipeline health. Alarms trigger SNS notifications straight to the ML engineering team's Slack channel. When something breaks at 2 AM, the on-call engineer can see exactly which stage failed and why before they even open the console.

Production Patterns

Everything above comes together in a production pipeline that handles the full lifecycle. Trigger, train, evaluate, deploy, monitor, retrain. Over and over.

Blue/Green Model Deployment

Deploying a new model to a SageMaker endpoint is slow. You cannot just swap models and hope for the best. The blue/green pattern gives you zero-downtime deployment:

  1. Create new endpoint configuration with the new model
  2. Update the endpoint: SageMaker provisions new instances behind the load balancer
  3. Run smoke tests against the new instances
  4. Shift traffic: SageMaker routes traffic to new instances and drains old ones
  5. Monitor for regression: Watch error rates and latency for 15-30 minutes
  6. Complete or rollback: If metrics are stable, decommission old instances; if metrics degrade, shift traffic back

In Step Functions, this maps to a series of states with Wait states for stabilization periods and Choice states for go/no-go decisions. Clean and auditable.

A/B Testing Integration

Sometimes you need to measure business impact before committing to a full deployment. SageMaker endpoint production variants let you run A/B tests:

ConfigurationUse CaseTraffic Split
Shadow modeCompare predictions without affecting users100% to production, 100% mirrored to candidate
Canary deploymentGradual rollout with metrics monitoring95/5 → 90/10 → 50/50 → 100/0
Multi-variantTest multiple model versions simultaneouslyEqual split across variants

The Step Functions workflow creates the multi-variant endpoint configuration and then waits. Literally. Wait states collect sufficient data before the pipeline makes a promotion decision. Patience built into the architecture.

Automated Retraining Triggers

Models degrade. Data distributions shift. User behavior changes. The pipeline needs to run continuously as part of a feedback loop. Automated retraining triggers close that loop:

TriggerSourcePipeline Action
ScheduleEventBridge rule (daily/weekly)Full pipeline execution
Data driftModel MonitorPipeline execution with drift flag
Performance degradationCloudWatch alarmPipeline execution with urgency flag
New data volumeS3 event notificationPipeline execution after threshold
ManualConsole/API/CI-CDPipeline execution with custom parameters

EventBridge rules trigger the Step Functions state machine. The input event includes the trigger type, and the pipeline adjusts its behavior accordingly. A drift-triggered retraining uses a larger dataset window than a scheduled retraining, for instance. The trigger context drives the pipeline configuration without requiring separate pipeline definitions for each scenario.

The Complete Production Pipeline

Here is the complete end-to-end production pipeline:

flowchart LR
    A[EventBridge
Trigger] --> B[Data
Preparation]
    B --> C[Feature
Engineering]
    C --> D[Training
Map State]
    D --> E[Model
Selection]
    E --> F[Evaluation
vs Baseline]
    F --> G{Auto-Promote
or Review?}
    G -->|Auto| H[Register
Model]
    G -->|Review| I[Human
Approval]
    I -->|Approved| H
    I -->|Rejected| J[Archive
& Alert]
    H --> K[Blue/Green
Deploy]
    K --> L[Smoke
Tests]
    L --> M{Tests
Pass?}
    M -->|Yes| N[Shift
Traffic]
    M -->|No| O[Rollback
Endpoint]
    N --> P[Model
Monitor]
    P --> Q{Drift or
Degradation?}
    Q -->|Yes| A
    Q -->|No| P
End-to-end production training pipeline

Standard Step Functions workflow at the top level, nested Express workflows for data validation substeps. End to end, a medium-sized training job takes 2-4 hours from data ingestion to production traffic. Training dominates that time. Everything else finishes in minutes.

Key Production Considerations

Before you deploy any of this, sort out these operational requirements. I have seen teams skip these and regret it within a month:

RequirementImplementation
IAM least privilegeSeparate execution roles for pipeline, training, and deployment
VPC networkingTraining jobs in private subnets with VPC endpoints for S3, ECR, CloudWatch (see Best Practices for Networking in AWS SageMaker)
Artifact encryptionKMS customer-managed keys for S3 model artifacts and EBS volumes
Audit loggingCloudTrail for API calls, Step Functions history for pipeline decisions
Cost taggingConsistent tags across all resources: project, pipeline, execution-id
QuotasRequest limit increases for concurrent training jobs, endpoints, and processing jobs

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.