Building Large-Scale SageMaker Training Pipelines with Step Functions

January 15, 2026 at 07:45AWS Architecture SageMaker Step Functions Machine Learning

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

I have spent the last several months orchestrating ML training pipelines that coordinate dozens of SageMaker jobs: preprocessing, feature engineering, distributed training, hyperparameter tuning, evaluation, conditional deployment. The pattern I keep seeing is that teams pour effort into model architecture and training code while treating the orchestration layer as an afterthought. Then the orchestration layer is exactly where the ugliest production failures happen. This article is my architecture reference for building training pipelines on AWS Step Functions at scale. If you have already read my AWS Step Functions: An Architecture Deep-Dive, the execution model and state types will be familiar. Here we get into the problems specific to ML pipelines: training jobs that run for hours, spot instances that vanish mid-epoch, models that need human sign-off before they touch production traffic, and the retraining loops that keep everything from going stale.

Why Step Functions for ML Pipelines

You have to pick an orchestrator before you build anything else, and this choice will shape every decision downstream. I have built ML pipelines on all the major AWS options. They overlap in feature lists enough to make the decision confusing, but once you operate them in production, the differences become obvious fast.

Capability	Step Functions	SageMaker Pipelines	MWAA (Airflow)	Custom Lambda
Native SageMaker integration	Service integration (optimized)	Native (built-in)	Via boto3 operators	Via SDK calls
Visual debugging	Excellent: per-execution graph	Good: pipeline DAG	Good: DAG view	None
Max execution duration	1 year (Standard)	No hard limit	No hard limit	15 minutes per invocation
Branching/conditional logic	Choice states, native	Conditions, limited	Python branching	Custom code
Parallel execution	Map state, Parallel state	Parallelism steps	Task parallelism	Fan-out pattern
Human approval gates	Callback pattern, native	Not native	Manual sensors	Custom implementation
Cost model	Per state transition	Free (pay for jobs)	Hourly environment	Per invocation
Error handling	Retry/Catch per state	Retry policies	Task retries	Try/except
Nested pipelines	Native (nested executions)	Not supported	SubDAGs/TaskGroups	Recursive invocation
Event-driven triggers	EventBridge native	EventBridge/Schedule	External trigger only	EventBridge native

My recommendation: Use Step Functions when your pipeline has conditional branching, human approval gates, or needs to reach beyond SageMaker into other AWS services. SageMaker Pipelines wins when your workflow is a straightforward DAG touching only SageMaker and you want zero orchestration cost. MWAA makes sense if your team already lives in Airflow or your pipeline spans multiple clouds.

The pipelines in this article are multi-stage, conditional, parallel, and include human approval. Step Functions wins handily.

Pipeline Architecture Overview

A production training pipeline is so much more than "call SageMaker and wait." You are building a multi-stage workflow that takes raw data and turns it into a deployed, monitored model. Every stage has to handle failures cleanly, support partial reruns (because you will need them), and produce auditable artifacts.

flowchart TD
    A[S3 Data Landing] --> B[Data Validation]
    B --> C{Data Quality
Gate}
    C -->|Pass| D[Feature Engineering]
    C -->|Fail| E[Alert & Halt]
    D --> F[Feature Store]
    F --> G[Training Jobs]
    G --> H[Model Evaluation]
    H --> I{Meets
Baseline?}
    I -->|Exceeds| J[Register Model]
    I -->|Below| K[Retrain with
New Hyperparameters]
    I -->|Far Below| L[Alert & Investigate]
    K --> G
    J --> M{Approval
Gate}
    M -->|Approved| N[Deploy to Endpoint]
    M -->|Rejected| L
    N --> O[Smoke Tests]
    O --> P[Production Traffic]
    P --> Q[Model Monitoring]
    Q --> R{Drift
Detected?}
    R -->|Yes| A
    R -->|No| Q

Multi-stage training pipeline architecture

Each box maps to one or more Step Functions states. Notice the cycle: model monitoring feeds back into data ingestion, forming an automated retraining loop. A single Step Functions execution handles one pass through the pipeline. When monitoring detects drift, a separate EventBridge rule kicks off a fresh execution.

Data Preparation Stage

In my experience, data preparation is where pipelines break most often. Raw data shows up with schema drift, missing values, unexpected cardinality, weird encoding. If you let bad data slide through, you burn GPU hours training on garbage. The data preparation sub-workflow exists to catch problems before they get expensive.

flowchart LR
    A[S3 Raw Data] --> B[Schema
Validation]
    B --> C{Schema
Valid?}
    C -->|Yes| D[SageMaker
Processing Job]
    C -->|No| E[Dead Letter
& Alert]
    D --> F[Feature
Engineering]
    F --> G[Statistical
Validation]
    G --> H{Distribution
Shift?}
    H -->|No| I[Write to
Feature Store]
    H -->|Yes| J[Flag for
Review]
    I --> K[Training
Dataset Ready]
    J --> K

Data preparation sub-workflow

Schema Validation

The first state runs a lightweight Lambda function that validates incoming data against an expected schema. A few cents per invocation. It catches the most common data problems before they trigger a Processing job that actually costs money.

I check for:

Column presence and types: All expected features present, numeric columns actually numeric
Cardinality bounds: Catches the case where a categorical column that should have 50 values suddenly has 50,000
Null ratios: Flags when the percentage of nulls in a critical column blows past acceptable thresholds
File format and encoding: Confirms Parquet/CSV format with correct encoding

SageMaker Processing Jobs

I use SageMaker Processing jobs with custom containers for feature engineering. I tried the built-in processors early on and kept hitting walls. Here is why:

Approach	Pros	Cons
Built-in SKLearnProcessor	Quick setup, managed container	Limited library versions, constrained customization
Built-in SparkProcessor	Distributed processing, large datasets	Heavy startup time (~5 min), overkill for most jobs
Custom container	Full control, reproducible, any library	Must maintain container, ECR costs
Lambda	Fast startup, cheap	15-min limit, 10 GB memory limit, no GPU

Under 50 GB of data? A custom container on ml.m5.4xlarge handles feature engineering in minutes. Above 50 GB, SparkProcessor on a cluster becomes worthwhile despite the startup penalty.

Statistical Validation

After feature engineering, I run a statistical validation step that compares transformed feature distributions against a baseline stored in S3. Schema validation will not catch a feature that remains numeric but whose mean has shifted by three standard deviations. This step will. I store distribution baselines as JSON artifacts and run a Kolmogorov-Smirnov test with configurable significance thresholds. It has saved me from deploying models trained on silently corrupted data more than once.

Training Stage

This is where the real money gets spent. Training eats more compute than every other stage combined, which means cost optimization here has the most leverage by far. Get this stage right and you need distributed training across multiple instances, spot instances for cost savings, checkpoints for recovery, and optionally hyperparameter tuning.

Instance Selection

Instance selection is a cost-performance tradeoff driven by your model architecture, dataset size, and how long you are willing to wait.

Instance Type	vCPUs	GPU	GPU Memory	On-Demand $/hr	Spot $/hr (typical)	Best For
ml.m5.xlarge	4	None	N/A	$0.23	$0.07	Sklearn, XGBoost (small)
ml.m5.4xlarge	16	None	N/A	$0.92	$0.28	XGBoost, feature-heavy tabular
ml.g4dn.xlarge	4	1x T4	16 GB	$0.74	$0.22	Fine-tuning small models, inference testing
ml.g5.2xlarge	8	1x A10G	24 GB	$1.52	$0.46	Medium transformer fine-tuning
ml.p3.2xlarge	8	1x V100	16 GB	$3.83	$1.15	Large model training, mixed precision
ml.p3.8xlarge	32	4x V100	64 GB	$14.69	$4.41	Distributed training, large batch
ml.p4d.24xlarge	96	8x A100	320 GB	$37.69	$11.31	Foundation model fine-tuning
ml.trn1.32xlarge	128	16x Trainium	512 GB	$24.78	$7.43	Large-scale training (Trainium-compatible)

My rule of thumb: start with the smallest GPU instance that fits your model in memory, then scale horizontally only when single-instance training blows your time budget. Vertical scaling (bigger instance) is always simpler to operate than horizontal scaling (distributed training). You avoid the inter-node communication overhead entirely. I have watched teams jump straight to multi-node training and spend weeks debugging NCCL timeouts when a single larger instance would have been fine.

Spot Instance Strategy

Spot instances cut training costs by 60-70%. The catch: they can disappear at any moment. I handle this in the Step Functions state machine with a layered retry strategy:

First attempt: Spot instance, checkpointing every 10 minutes
On interruption: Retry on spot with checkpoint resume (most interruptions are transient)
After 3 spot failures: Fall back to on-demand with checkpoint resume
On-demand failure: Alert and halt. Something is fundamentally wrong at this point.

The retry configuration for the training state looks like this conceptually:

{
  "Retry": [
    {
      "ErrorEquals": ["SageMaker.SpotInterruption"],
      "IntervalSeconds": 60,
      "MaxAttempts": 3,
      "BackoffRate": 1.5
    },
    {
      "ErrorEquals": ["SageMaker.ResourceLimitExceeded"],
      "IntervalSeconds": 300,
      "MaxAttempts": 2,
      "BackoffRate": 2.0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "TrainingFailureHandler"
    }
  ]
}

Checkpoint Management

Checkpoints make spot instances viable. Without them, a spot interruption after 6 hours of training means 6 wasted hours. Period. With checkpoints saved every 10 minutes to S3, you lose 10 minutes of work at most. That tradeoff is worth a little S3 storage cost.

I organize checkpoints around three S3 paths:

Path	Purpose	Lifecycle
`s3://bucket/pipeline/{execution-id}/checkpoints/`	Active checkpoints for current training run	Deleted after successful training
`s3://bucket/pipeline/{execution-id}/model/`	Final model artifacts	Retained per model registry policy
`s3://bucket/pipeline/{execution-id}/logs/`	Training logs and metrics	Retained for 90 days

The Step Functions execution ID flows through every SageMaker job as both a tag and an S3 path component. This gives you full traceability from any pipeline execution to the exact model artifact it produced. When something goes wrong six months later, you will be glad you wired this up.

Distributed Training

Sometimes a single instance just cannot finish training in the time you have. That is when you reach for distributed training. SageMaker supports two strategies:

Strategy	How It Works	Best For	Overhead
Data Parallel	Each instance trains on a data shard, gradients synchronized	Large datasets, models that fit on one GPU	Low: gradient sync only
Model Parallel	Model layers split across instances	Models too large for single GPU memory	High: inter-node activations

For data parallelism, SageMaker's built-in data parallel library handles gradient synchronization automatically. The two things you need to get right are instance count and per-instance batch size. Scale the learning rate linearly with total batch size (the "linear scaling rule") and include a warmup period. Skip the warmup and you will watch your loss explode in the first few steps. I learned this the hard way.

Evaluation and Model Registry

Training finished. Now the pipeline has to answer one question: is this model better than what is already serving production traffic? That question is harder than it sounds. You need to compare multiple metrics across multiple data slices, check for regression on specific segments, and sometimes get a human to weigh in.

flowchart TD
    A[Trained Model] --> B[Batch Transform
Evaluation]
    B --> C[Compute Metrics
Across Slices]
    C --> D{Compare to
Baseline Model}
    D -->|All metrics improved| E[Auto-Promote]
    D -->|Mixed results| F[Human Review
Callback]
    D -->|All metrics degraded| G[Reject &
Investigate]
    E --> H[Register in
Model Registry]
    F --> I{Reviewer
Decision}
    I -->|Approve| H
    I -->|Reject| G
    H --> J[Model Package
with Metadata]
    G --> K[Create
Investigation Ticket]

Evaluation decision flow

Multi-Metric Evaluation

If you are making deployment decisions based on a single accuracy number, stop. I evaluate models on a matrix of metrics and data slices:

Dimension	Metrics	Failure Threshold
Overall performance	Accuracy, F1, AUC-ROC	Any metric drops > 2% from baseline
Segment performance	Per-segment F1, fairness metrics	Any segment drops > 5%
Latency	P50, P95, P99 inference time	P99 exceeds endpoint SLA
Model size	Artifact size, memory footprint	Exceeds deployment instance capacity
Data coverage	Feature importance stability	Top-10 features differ significantly

The evaluation step runs as a SageMaker Processing job that produces a JSON report. A Choice state in Step Functions parses this report and routes accordingly: auto-promote, human review, or reject.

Model Registry Integration

When a model passes evaluation, I register it in SageMaker Model Registry with all the metadata you will wish you had six months from now:

Model package group: Groups versions of the same model
Approval status: PendingManualApproval or Approved
Inference specification: Container image, instance types, input/output formats
Model metrics: All evaluation metrics attached as metadata
Pipeline execution ID: Link back to the Step Functions execution for full lineage
Training data hash: SHA-256 of the training dataset for reproducibility
Git commit: The commit hash of the training code

Full model lineage. From any production prediction, you can trace back to the exact training data, code version, and pipeline execution that produced the model. This matters when someone asks "why did the model do that?" and you need an answer within the hour.

Error Handling and Retry Strategies

ML pipelines fail in ways that web services do not. Training jobs run for hours. Spot instances vanish mid-epoch. GPU memory errors surface three hours into a run. Data processing jobs fail because someone upstream changed a schema without telling you. Your Step Functions error handling has to deal with all of this gracefully.

Failure Classification

Not all failures deserve the same response. I sort them into four categories:

Category	Examples	Response	Step Functions Pattern
Transient	Spot interruption, throttling, transient S3 error	Automatic retry with backoff	`Retry` with `BackoffRate`
Resource	Insufficient capacity, GPU OOM	Retry with different config	`Catch` → resize → retry state
Data	Schema mismatch, corrupt records	Halt and alert, no retry	`Catch` → alert → fail
Logic	NaN loss, divergent training	Retry with different hyperparameters	`Catch` → adjust → retry state

The Saga Pattern for Pipeline Rollback

Pipelines fail partway through. You register a model, start the deployment, and something blows up. Now you have partial state sitting around. You need compensating transactions to clean it up. I implement the saga pattern in Step Functions through a parallel error handling branch:

Data preparation fails: No cleanup needed; no state was created
Training fails: Delete partial checkpoints and model artifacts from S3
Evaluation fails: Delete evaluation artifacts, deregister the model if already registered
Deployment fails: Roll back endpoint to previous model version, deregister new model package

Each compensating action is a Lambda function invoked from the Catch handler. Make these idempotent. Running a compensating action twice should not produce additional failures. If it does, you now have two problems instead of one.

GPU Out-of-Memory Recovery

GPU OOM errors happen constantly when you are experimenting with batch sizes or model configurations. Your pipeline can recover automatically instead of failing and paging someone at 3 AM:

Catch the AlgorithmError from the training job
Parse the error message for OOM indicators
Reduce the batch size by 50%
Resume from the last checkpoint with the reduced batch size
If batch size falls below a minimum threshold, switch to a larger instance type

This adaptive retry pattern alone has saved my team from countless failed overnight training runs.

Orchestration Patterns

Step Functions gives you several state types that map naturally to ML pipeline needs. The difference between a fragile pipeline and a robust one often comes down to knowing which patterns to combine and when.

Map State for Parallel Training

The Map state does the heavy lifting for parallel training. I use it for hyperparameter sweeps, cross-validation folds, and training multiple model variants at the same time.

flowchart TD
    A[Generate Training
Configurations] --> B[Map State:
Parallel Training]

    subgraph B[Map State: Parallel Training]
        direction TB
        C1[Config 1:
lr=0.001, batch=32] --> D1[Training Job 1]
        C2[Config 2:
lr=0.01, batch=64] --> D2[Training Job 2]
        C3[Config 3:
lr=0.001, batch=64] --> D3[Training Job 3]
        C4[Config N:
...] --> D4[Training Job N]
    end

    B --> E[Collect Results]
    E --> F[Select Best Model]
    F --> G[Evaluation Stage]

Parallel training orchestration with Map state

You feed it an array of training configurations and it launches a separate training job for each. A few configuration decisions matter here:

Setting	Recommendation	Rationale
MaxConcurrency	5-10	AWS account limits on concurrent training jobs; higher values risk throttling
ToleratedFailurePercentage	20-30%	Some configurations will fail (OOM, divergence); do not halt the entire sweep
ItemBatcher	Batch size 1	Each training job is independent; batching adds no value
ResultSelector	Extract metrics only	Full training output is large; select only what the next state needs

Nested Workflows for Reusable Sub-Pipelines

Do not build one monolithic state machine. Break it up into nested workflows:

Data preparation workflow: Reusable across multiple model training pipelines
Training workflow: Parameterized by algorithm, instance type, hyperparameters
Evaluation workflow: Reusable evaluation logic with configurable metrics and thresholds
Deployment workflow: Handles endpoint creation, traffic shifting, smoke tests

Each one is a separate Step Functions state machine invoked via the StartExecution service integration. The parent passes parameters in and gets the child's output back. You gain independent testing, versioning, and reuse for each piece. When the data team updates the feature engineering logic, they deploy their workflow without touching yours.

Callback Pattern for Human Approval

Any model that touches revenue-critical decisions needs a human to sign off before it goes live. No exceptions. The callback pattern handles this:

The pipeline reaches the approval state and sends a task token to an SNS topic
SNS triggers a notification (email, Slack, PagerDuty)
The pipeline pauses; the Step Functions execution waits (up to 1 year)
A reviewer examines the model metrics, evaluation report, and comparison to baseline
The reviewer calls SendTaskSuccess or SendTaskFailure via an approval UI or CLI
The pipeline resumes based on the reviewer's decision

I build a simple approval UI as a static page hosted on S3. It displays the evaluation report and gives the reviewer Approve/Reject buttons. Those buttons call API Gateway, which triggers a Lambda that calls SendTaskSuccess or SendTaskFailure. The Lambda validates the reviewer's identity against IAM and logs the decision for audit compliance. The whole thing takes an afternoon to build and saves you from the "someone deployed a model via Slack DM" problem.

Cost Optimization

ML training burns money. One ml.p4d.24xlarge instance: $37.69 per hour. A hyperparameter tuning job running 20 training jobs across 10 instances can clear $1,000 in a single run. You have to treat cost optimization as an architectural concern from day one, or the finance team will come knocking.

Spot Instance Savings

Instance Family	On-Demand $/hr	Typical Spot $/hr	Savings	Interruption Rate
ml.m5 (CPU)	$0.23 - $3.69	$0.07 - $1.11	~70%	Low (< 5%)
ml.g4dn (T4 GPU)	$0.74 - $7.82	$0.22 - $2.35	~70%	Low (< 5%)
ml.g5 (A10G GPU)	$1.52 - $24.32	$0.46 - $7.30	~70%	Moderate (5-15%)
ml.p3 (V100 GPU)	$3.83 - $28.15	$1.15 - $8.45	~70%	Moderate (5-15%)
ml.p4d (A100 GPU)	$37.69	$11.31	~70%	Higher (10-20%)

The savings are real. So is the interruption risk, especially on larger GPU instances. The checkpoint and retry strategy from the Training Stage section makes this work. Without it, spot instances are a gamble. With it, they are a legitimate 70% cost reduction.

Managed Warm Pools

SageMaker Managed Warm Pools keep instances provisioned between training jobs. No more 3-5 minute cold starts. This matters most for iterative development and hyperparameter sweeps where you launch job after job after job.

Scenario	Without Warm Pools	With Warm Pools	Time Saved
20-job HP sweep (sequential)	60-100 min startup overhead	~5 min total	55-95 min
Daily retraining	3-5 min startup per run	~0 min	3-5 min/day
Interactive development	3-5 min per experiment	~0 min	Significant

You pay for warm instances while they sit idle, so set KeepAlivePeriodInSeconds to match your actual usage. For hyperparameter sweeps, 3600 seconds (1 hour) works well. For daily retraining? Paying for 23 idle hours to save 4 minutes of startup is a bad trade. Do the math on your own workload.

Standard vs. Express Workflows

Step Functions has two workflow types. The cost models are completely different:

Characteristic	Standard Workflow	Express Workflow
Max duration	1 year	5 minutes
Pricing	Per state transition ($0.025/1000)	Per request + duration
Execution history	90 days built-in	CloudWatch Logs only
Exactly-once execution	Yes	At-least-once
Suitable for training pipelines?	Yes: primary pipeline	No: too short for training
Suitable for data validation?	Overkill	Yes: fast, cheap

I use a hybrid approach. The main pipeline runs as a Standard workflow because training jobs are long-running and I need exactly-once semantics. The lightweight data validation sub-workflow runs as an Express workflow nested inside. Fast, cheap, and I do not care if it runs at-least-once because the validation is idempotent anyway. Best of both worlds.

Monitoring and Observability

You need visibility into every stage of the pipeline. When a model starts misbehaving in production, the first question is always "what changed?" Was it the data? The training? The infrastructure? Without observability wired in from the start, you are guessing.

CloudWatch Integration

SageMaker training jobs emit metrics to CloudWatch automatically. Here is what I monitor and alert on:

Metric	Source	Alert Threshold
`train:loss`	Training algorithm	Loss increasing for > 3 epochs
`validation:accuracy`	Training algorithm	Below baseline by > 5%
`GPUUtilization`	SageMaker infrastructure	Below 50% (over-provisioned) or above 95% (throttled)
`MemoryUtilization`	SageMaker infrastructure	Above 90% (OOM risk)
`DiskUtilization`	SageMaker infrastructure	Above 80% (checkpoint storage)

Step Functions Execution History

Step Functions keeps execution history for 90 days on Standard workflows. For each execution, you get:

Input/output for every state: What parameters were passed, what results returned
Duration per state: Where the pipeline spends its time
Error details: Full error messages and stack traces for failed states
Retry attempts: How many retries occurred and why

My CloudWatch dashboards track:

Pipeline success rate: Percentage of executions that complete successfully
Stage duration trends: Whether training jobs are getting slower over time
Cost per execution: Total compute cost tracked via tags
Failure categorization: Which failure categories are most common

Custom Metrics Pipeline

The built-in metrics only get you so far. I push custom metrics from each pipeline stage:

Stage	Custom Metrics
Data preparation	Records processed, feature count, null ratio
Training	Final loss, best metric, total epochs, early stopping epoch
Evaluation	All comparison metrics, pass/fail decision
Deployment	Endpoint creation time, smoke test latency

All of this feeds a single CloudWatch dashboard. One screen, full picture of pipeline health. Alarms trigger SNS notifications straight to the ML engineering team's Slack channel. When something breaks at 2 AM, the on-call engineer can see exactly which stage failed and why before they even open the console.

Production Patterns

Everything above comes together in a production pipeline that handles the full lifecycle. Trigger, train, evaluate, deploy, monitor, retrain. Over and over.

Blue/Green Model Deployment

Deploying a new model to a SageMaker endpoint is slow. You cannot just swap models and hope for the best. The blue/green pattern gives you zero-downtime deployment:

Create new endpoint configuration with the new model
Update the endpoint: SageMaker provisions new instances behind the load balancer
Run smoke tests against the new instances
Shift traffic: SageMaker routes traffic to new instances and drains old ones
Monitor for regression: Watch error rates and latency for 15-30 minutes
Complete or rollback: If metrics are stable, decommission old instances; if metrics degrade, shift traffic back

In Step Functions, this maps to a series of states with Wait states for stabilization periods and Choice states for go/no-go decisions. Clean and auditable.

A/B Testing Integration

Sometimes you need to measure business impact before committing to a full deployment. SageMaker endpoint production variants let you run A/B tests:

Configuration	Use Case	Traffic Split
Shadow mode	Compare predictions without affecting users	100% to production, 100% mirrored to candidate
Canary deployment	Gradual rollout with metrics monitoring	95/5 → 90/10 → 50/50 → 100/0
Multi-variant	Test multiple model versions simultaneously	Equal split across variants

The Step Functions workflow creates the multi-variant endpoint configuration and then waits. Literally. Wait states collect sufficient data before the pipeline makes a promotion decision. Patience built into the architecture.

Automated Retraining Triggers

Models degrade. Data distributions shift. User behavior changes. The pipeline needs to run continuously as part of a feedback loop. Automated retraining triggers close that loop:

Trigger	Source	Pipeline Action
Schedule	EventBridge rule (daily/weekly)	Full pipeline execution
Data drift	Model Monitor	Pipeline execution with drift flag
Performance degradation	CloudWatch alarm	Pipeline execution with urgency flag
New data volume	S3 event notification	Pipeline execution after threshold
Manual	Console/API/CI-CD	Pipeline execution with custom parameters

EventBridge rules trigger the Step Functions state machine. The input event includes the trigger type, and the pipeline adjusts its behavior accordingly. A drift-triggered retraining uses a larger dataset window than a scheduled retraining, for instance. The trigger context drives the pipeline configuration without requiring separate pipeline definitions for each scenario.

The Complete Production Pipeline

Here is the complete end-to-end production pipeline:

flowchart LR
    A[EventBridge
Trigger] --> B[Data
Preparation]
    B --> C[Feature
Engineering]
    C --> D[Training
Map State]
    D --> E[Model
Selection]
    E --> F[Evaluation
vs Baseline]
    F --> G{Auto-Promote
or Review?}
    G -->|Auto| H[Register
Model]
    G -->|Review| I[Human
Approval]
    I -->|Approved| H
    I -->|Rejected| J[Archive
& Alert]
    H --> K[Blue/Green
Deploy]
    K --> L[Smoke
Tests]
    L --> M{Tests
Pass?}
    M -->|Yes| N[Shift
Traffic]
    M -->|No| O[Rollback
Endpoint]
    N --> P[Model
Monitor]
    P --> Q{Drift or
Degradation?}
    Q -->|Yes| A
    Q -->|No| P

End-to-end production training pipeline

Standard Step Functions workflow at the top level, nested Express workflows for data validation substeps. End to end, a medium-sized training job takes 2-4 hours from data ingestion to production traffic. Training dominates that time. Everything else finishes in minutes.

Key Production Considerations

Before you deploy any of this, sort out these operational requirements. I have seen teams skip these and regret it within a month:

Requirement	Implementation
IAM least privilege	Separate execution roles for pipeline, training, and deployment
VPC networking	Training jobs in private subnets with VPC endpoints for S3, ECR, CloudWatch (see Best Practices for Networking in AWS SageMaker)
Artifact encryption	KMS customer-managed keys for S3 model artifacts and EBS volumes
Audit logging	CloudTrail for API calls, Step Functions history for pipeline decisions
Cost tagging	Consistent tags across all resources: project, pipeline, execution-id
Quotas	Request limit increases for concurrent training jobs, endpoints, and processing jobs

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.

Get in Touch View Background LinkedIn