About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
I have spent the last several months orchestrating ML training pipelines that coordinate dozens of SageMaker jobs: preprocessing, feature engineering, distributed training, hyperparameter tuning, evaluation, conditional deployment. The pattern I keep seeing is that teams pour effort into model architecture and training code while treating the orchestration layer as an afterthought. Then the orchestration layer is exactly where the ugliest production failures happen. This article is my architecture reference for building training pipelines on AWS Step Functions at scale. If you have already read my AWS Step Functions: An Architecture Deep-Dive, the execution model and state types will be familiar. Here we get into the problems specific to ML pipelines: training jobs that run for hours, spot instances that vanish mid-epoch, models that need human sign-off before they touch production traffic, and the retraining loops that keep everything from going stale.
Why Step Functions for ML Pipelines
You have to pick an orchestrator before you build anything else, and this choice will shape every decision downstream. I have built ML pipelines on all the major AWS options. They overlap in feature lists enough to make the decision confusing, but once you operate them in production, the differences become obvious fast.
| Capability | Step Functions | SageMaker Pipelines | MWAA (Airflow) | Custom Lambda |
|---|---|---|---|---|
| Native SageMaker integration | Service integration (optimized) | Native (built-in) | Via boto3 operators | Via SDK calls |
| Visual debugging | Excellent: per-execution graph | Good: pipeline DAG | Good: DAG view | None |
| Max execution duration | 1 year (Standard) | No hard limit | No hard limit | 15 minutes per invocation |
| Branching/conditional logic | Choice states, native | Conditions, limited | Python branching | Custom code |
| Parallel execution | Map state, Parallel state | Parallelism steps | Task parallelism | Fan-out pattern |
| Human approval gates | Callback pattern, native | Not native | Manual sensors | Custom implementation |
| Cost model | Per state transition | Free (pay for jobs) | Hourly environment | Per invocation |
| Error handling | Retry/Catch per state | Retry policies | Task retries | Try/except |
| Nested pipelines | Native (nested executions) | Not supported | SubDAGs/TaskGroups | Recursive invocation |
| Event-driven triggers | EventBridge native | EventBridge/Schedule | External trigger only | EventBridge native |
My recommendation: Use Step Functions when your pipeline has conditional branching, human approval gates, or needs to reach beyond SageMaker into other AWS services. SageMaker Pipelines wins when your workflow is a straightforward DAG touching only SageMaker and you want zero orchestration cost. MWAA makes sense if your team already lives in Airflow or your pipeline spans multiple clouds.
The pipelines in this article are multi-stage, conditional, parallel, and include human approval. Step Functions wins handily.
Pipeline Architecture Overview
A production training pipeline is so much more than "call SageMaker and wait." You are building a multi-stage workflow that takes raw data and turns it into a deployed, monitored model. Every stage has to handle failures cleanly, support partial reruns (because you will need them), and produce auditable artifacts.
flowchart TD
A[S3 Data Landing] --> B[Data Validation]
B --> C{Data Quality
Gate}
C -->|Pass| D[Feature Engineering]
C -->|Fail| E[Alert & Halt]
D --> F[Feature Store]
F --> G[Training Jobs]
G --> H[Model Evaluation]
H --> I{Meets
Baseline?}
I -->|Exceeds| J[Register Model]
I -->|Below| K[Retrain with
New Hyperparameters]
I -->|Far Below| L[Alert & Investigate]
K --> G
J --> M{Approval
Gate}
M -->|Approved| N[Deploy to Endpoint]
M -->|Rejected| L
N --> O[Smoke Tests]
O --> P[Production Traffic]
P --> Q[Model Monitoring]
Q --> R{Drift
Detected?}
R -->|Yes| A
R -->|No| Q Each box maps to one or more Step Functions states. Notice the cycle: model monitoring feeds back into data ingestion, forming an automated retraining loop. A single Step Functions execution handles one pass through the pipeline. When monitoring detects drift, a separate EventBridge rule kicks off a fresh execution.
Data Preparation Stage
In my experience, data preparation is where pipelines break most often. Raw data shows up with schema drift, missing values, unexpected cardinality, weird encoding. If you let bad data slide through, you burn GPU hours training on garbage. The data preparation sub-workflow exists to catch problems before they get expensive.
flowchart LR
A[S3 Raw Data] --> B[Schema
Validation]
B --> C{Schema
Valid?}
C -->|Yes| D[SageMaker
Processing Job]
C -->|No| E[Dead Letter
& Alert]
D --> F[Feature
Engineering]
F --> G[Statistical
Validation]
G --> H{Distribution
Shift?}
H -->|No| I[Write to
Feature Store]
H -->|Yes| J[Flag for
Review]
I --> K[Training
Dataset Ready]
J --> K Schema Validation
The first state runs a lightweight Lambda function that validates incoming data against an expected schema. A few cents per invocation. It catches the most common data problems before they trigger a Processing job that actually costs money.
I check for:
- Column presence and types: All expected features present, numeric columns actually numeric
- Cardinality bounds: Catches the case where a categorical column that should have 50 values suddenly has 50,000
- Null ratios: Flags when the percentage of nulls in a critical column blows past acceptable thresholds
- File format and encoding: Confirms Parquet/CSV format with correct encoding
SageMaker Processing Jobs
I use SageMaker Processing jobs with custom containers for feature engineering. I tried the built-in processors early on and kept hitting walls. Here is why:
| Approach | Pros | Cons |
|---|---|---|
| Built-in SKLearnProcessor | Quick setup, managed container | Limited library versions, constrained customization |
| Built-in SparkProcessor | Distributed processing, large datasets | Heavy startup time (~5 min), overkill for most jobs |
| Custom container | Full control, reproducible, any library | Must maintain container, ECR costs |
| Lambda | Fast startup, cheap | 15-min limit, 10 GB memory limit, no GPU |
Under 50 GB of data? A custom container on ml.m5.4xlarge handles feature engineering in minutes. Above 50 GB, SparkProcessor on a cluster becomes worthwhile despite the startup penalty.
Statistical Validation
After feature engineering, I run a statistical validation step that compares transformed feature distributions against a baseline stored in S3. Schema validation will not catch a feature that remains numeric but whose mean has shifted by three standard deviations. This step will. I store distribution baselines as JSON artifacts and run a Kolmogorov-Smirnov test with configurable significance thresholds. It has saved me from deploying models trained on silently corrupted data more than once.
Training Stage
This is where the real money gets spent. Training eats more compute than every other stage combined, which means cost optimization here has the most leverage by far. Get this stage right and you need distributed training across multiple instances, spot instances for cost savings, checkpoints for recovery, and optionally hyperparameter tuning.
Instance Selection
Instance selection is a cost-performance tradeoff driven by your model architecture, dataset size, and how long you are willing to wait.
| Instance Type | vCPUs | GPU | GPU Memory | On-Demand $/hr | Spot $/hr (typical) | Best For |
|---|---|---|---|---|---|---|
| ml.m5.xlarge | 4 | None | N/A | $0.23 | $0.07 | Sklearn, XGBoost (small) |
| ml.m5.4xlarge | 16 | None | N/A | $0.92 | $0.28 | XGBoost, feature-heavy tabular |
| ml.g4dn.xlarge | 4 | 1x T4 | 16 GB | $0.74 | $0.22 | Fine-tuning small models, inference testing |
| ml.g5.2xlarge | 8 | 1x A10G | 24 GB | $1.52 | $0.46 | Medium transformer fine-tuning |
| ml.p3.2xlarge | 8 | 1x V100 | 16 GB | $3.83 | $1.15 | Large model training, mixed precision |
| ml.p3.8xlarge | 32 | 4x V100 | 64 GB | $14.69 | $4.41 | Distributed training, large batch |
| ml.p4d.24xlarge | 96 | 8x A100 | 320 GB | $37.69 | $11.31 | Foundation model fine-tuning |
| ml.trn1.32xlarge | 128 | 16x Trainium | 512 GB | $24.78 | $7.43 | Large-scale training (Trainium-compatible) |
My rule of thumb: start with the smallest GPU instance that fits your model in memory, then scale horizontally only when single-instance training blows your time budget. Vertical scaling (bigger instance) is always simpler to operate than horizontal scaling (distributed training). You avoid the inter-node communication overhead entirely. I have watched teams jump straight to multi-node training and spend weeks debugging NCCL timeouts when a single larger instance would have been fine.
Spot Instance Strategy
Spot instances cut training costs by 60-70%. The catch: they can disappear at any moment. I handle this in the Step Functions state machine with a layered retry strategy:
- First attempt: Spot instance, checkpointing every 10 minutes
- On interruption: Retry on spot with checkpoint resume (most interruptions are transient)
- After 3 spot failures: Fall back to on-demand with checkpoint resume
- On-demand failure: Alert and halt. Something is fundamentally wrong at this point.
The retry configuration for the training state looks like this conceptually:
{
"Retry": [
{
"ErrorEquals": ["SageMaker.SpotInterruption"],
"IntervalSeconds": 60,
"MaxAttempts": 3,
"BackoffRate": 1.5
},
{
"ErrorEquals": ["SageMaker.ResourceLimitExceeded"],
"IntervalSeconds": 300,
"MaxAttempts": 2,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "TrainingFailureHandler"
}
]
}
Checkpoint Management
Checkpoints make spot instances viable. Without them, a spot interruption after 6 hours of training means 6 wasted hours. Period. With checkpoints saved every 10 minutes to S3, you lose 10 minutes of work at most. That tradeoff is worth a little S3 storage cost.
I organize checkpoints around three S3 paths:
| Path | Purpose | Lifecycle |
|---|---|---|
s3://bucket/pipeline/{execution-id}/checkpoints/ | Active checkpoints for current training run | Deleted after successful training |
s3://bucket/pipeline/{execution-id}/model/ | Final model artifacts | Retained per model registry policy |
s3://bucket/pipeline/{execution-id}/logs/ | Training logs and metrics | Retained for 90 days |
The Step Functions execution ID flows through every SageMaker job as both a tag and an S3 path component. This gives you full traceability from any pipeline execution to the exact model artifact it produced. When something goes wrong six months later, you will be glad you wired this up.
Distributed Training
Sometimes a single instance just cannot finish training in the time you have. That is when you reach for distributed training. SageMaker supports two strategies:
| Strategy | How It Works | Best For | Overhead |
|---|---|---|---|
| Data Parallel | Each instance trains on a data shard, gradients synchronized | Large datasets, models that fit on one GPU | Low: gradient sync only |
| Model Parallel | Model layers split across instances | Models too large for single GPU memory | High: inter-node activations |
For data parallelism, SageMaker's built-in data parallel library handles gradient synchronization automatically. The two things you need to get right are instance count and per-instance batch size. Scale the learning rate linearly with total batch size (the "linear scaling rule") and include a warmup period. Skip the warmup and you will watch your loss explode in the first few steps. I learned this the hard way.
Evaluation and Model Registry
Training finished. Now the pipeline has to answer one question: is this model better than what is already serving production traffic? That question is harder than it sounds. You need to compare multiple metrics across multiple data slices, check for regression on specific segments, and sometimes get a human to weigh in.
flowchart TD
A[Trained Model] --> B[Batch Transform
Evaluation]
B --> C[Compute Metrics
Across Slices]
C --> D{Compare to
Baseline Model}
D -->|All metrics improved| E[Auto-Promote]
D -->|Mixed results| F[Human Review
Callback]
D -->|All metrics degraded| G[Reject &
Investigate]
E --> H[Register in
Model Registry]
F --> I{Reviewer
Decision}
I -->|Approve| H
I -->|Reject| G
H --> J[Model Package
with Metadata]
G --> K[Create
Investigation Ticket] Multi-Metric Evaluation
If you are making deployment decisions based on a single accuracy number, stop. I evaluate models on a matrix of metrics and data slices:
| Dimension | Metrics | Failure Threshold |
|---|---|---|
| Overall performance | Accuracy, F1, AUC-ROC | Any metric drops > 2% from baseline |
| Segment performance | Per-segment F1, fairness metrics | Any segment drops > 5% |
| Latency | P50, P95, P99 inference time | P99 exceeds endpoint SLA |
| Model size | Artifact size, memory footprint | Exceeds deployment instance capacity |
| Data coverage | Feature importance stability | Top-10 features differ significantly |
The evaluation step runs as a SageMaker Processing job that produces a JSON report. A Choice state in Step Functions parses this report and routes accordingly: auto-promote, human review, or reject.
Model Registry Integration
When a model passes evaluation, I register it in SageMaker Model Registry with all the metadata you will wish you had six months from now:
- Model package group: Groups versions of the same model
- Approval status:
PendingManualApprovalorApproved - Inference specification: Container image, instance types, input/output formats
- Model metrics: All evaluation metrics attached as metadata
- Pipeline execution ID: Link back to the Step Functions execution for full lineage
- Training data hash: SHA-256 of the training dataset for reproducibility
- Git commit: The commit hash of the training code
Full model lineage. From any production prediction, you can trace back to the exact training data, code version, and pipeline execution that produced the model. This matters when someone asks "why did the model do that?" and you need an answer within the hour.
Error Handling and Retry Strategies
ML pipelines fail in ways that web services do not. Training jobs run for hours. Spot instances vanish mid-epoch. GPU memory errors surface three hours into a run. Data processing jobs fail because someone upstream changed a schema without telling you. Your Step Functions error handling has to deal with all of this gracefully.
Failure Classification
Not all failures deserve the same response. I sort them into four categories:
| Category | Examples | Response | Step Functions Pattern |
|---|---|---|---|
| Transient | Spot interruption, throttling, transient S3 error | Automatic retry with backoff | Retry with BackoffRate |
| Resource | Insufficient capacity, GPU OOM | Retry with different config | Catch → resize → retry state |
| Data | Schema mismatch, corrupt records | Halt and alert, no retry | Catch → alert → fail |
| Logic | NaN loss, divergent training | Retry with different hyperparameters | Catch → adjust → retry state |
The Saga Pattern for Pipeline Rollback
Pipelines fail partway through. You register a model, start the deployment, and something blows up. Now you have partial state sitting around. You need compensating transactions to clean it up. I implement the saga pattern in Step Functions through a parallel error handling branch:
- Data preparation fails: No cleanup needed; no state was created
- Training fails: Delete partial checkpoints and model artifacts from S3
- Evaluation fails: Delete evaluation artifacts, deregister the model if already registered
- Deployment fails: Roll back endpoint to previous model version, deregister new model package
Each compensating action is a Lambda function invoked from the Catch handler. Make these idempotent. Running a compensating action twice should not produce additional failures. If it does, you now have two problems instead of one.
GPU Out-of-Memory Recovery
GPU OOM errors happen constantly when you are experimenting with batch sizes or model configurations. Your pipeline can recover automatically instead of failing and paging someone at 3 AM:
- Catch the
AlgorithmErrorfrom the training job - Parse the error message for OOM indicators
- Reduce the batch size by 50%
- Resume from the last checkpoint with the reduced batch size
- If batch size falls below a minimum threshold, switch to a larger instance type
This adaptive retry pattern alone has saved my team from countless failed overnight training runs.
Orchestration Patterns
Step Functions gives you several state types that map naturally to ML pipeline needs. The difference between a fragile pipeline and a robust one often comes down to knowing which patterns to combine and when.
Map State for Parallel Training
The Map state does the heavy lifting for parallel training. I use it for hyperparameter sweeps, cross-validation folds, and training multiple model variants at the same time.
flowchart TD
A[Generate Training
Configurations] --> B[Map State:
Parallel Training]
subgraph B[Map State: Parallel Training]
direction TB
C1[Config 1:
lr=0.001, batch=32] --> D1[Training Job 1]
C2[Config 2:
lr=0.01, batch=64] --> D2[Training Job 2]
C3[Config 3:
lr=0.001, batch=64] --> D3[Training Job 3]
C4[Config N:
...] --> D4[Training Job N]
end
B --> E[Collect Results]
E --> F[Select Best Model]
F --> G[Evaluation Stage] You feed it an array of training configurations and it launches a separate training job for each. A few configuration decisions matter here:
| Setting | Recommendation | Rationale |
|---|---|---|
| MaxConcurrency | 5-10 | AWS account limits on concurrent training jobs; higher values risk throttling |
| ToleratedFailurePercentage | 20-30% | Some configurations will fail (OOM, divergence); do not halt the entire sweep |
| ItemBatcher | Batch size 1 | Each training job is independent; batching adds no value |
| ResultSelector | Extract metrics only | Full training output is large; select only what the next state needs |
Nested Workflows for Reusable Sub-Pipelines
Do not build one monolithic state machine. Break it up into nested workflows:
- Data preparation workflow: Reusable across multiple model training pipelines
- Training workflow: Parameterized by algorithm, instance type, hyperparameters
- Evaluation workflow: Reusable evaluation logic with configurable metrics and thresholds
- Deployment workflow: Handles endpoint creation, traffic shifting, smoke tests
Each one is a separate Step Functions state machine invoked via the StartExecution service integration. The parent passes parameters in and gets the child's output back. You gain independent testing, versioning, and reuse for each piece. When the data team updates the feature engineering logic, they deploy their workflow without touching yours.
Callback Pattern for Human Approval
Any model that touches revenue-critical decisions needs a human to sign off before it goes live. No exceptions. The callback pattern handles this:
- The pipeline reaches the approval state and sends a task token to an SNS topic
- SNS triggers a notification (email, Slack, PagerDuty)
- The pipeline pauses; the Step Functions execution waits (up to 1 year)
- A reviewer examines the model metrics, evaluation report, and comparison to baseline
- The reviewer calls
SendTaskSuccessorSendTaskFailurevia an approval UI or CLI - The pipeline resumes based on the reviewer's decision
I build a simple approval UI as a static page hosted on S3. It displays the evaluation report and gives the reviewer Approve/Reject buttons. Those buttons call API Gateway, which triggers a Lambda that calls SendTaskSuccess or SendTaskFailure. The Lambda validates the reviewer's identity against IAM and logs the decision for audit compliance. The whole thing takes an afternoon to build and saves you from the "someone deployed a model via Slack DM" problem.
Cost Optimization
ML training burns money. One ml.p4d.24xlarge instance: $37.69 per hour. A hyperparameter tuning job running 20 training jobs across 10 instances can clear $1,000 in a single run. You have to treat cost optimization as an architectural concern from day one, or the finance team will come knocking.
Spot Instance Savings
| Instance Family | On-Demand $/hr | Typical Spot $/hr | Savings | Interruption Rate |
|---|---|---|---|---|
| ml.m5 (CPU) | $0.23 - $3.69 | $0.07 - $1.11 | ~70% | Low (< 5%) |
| ml.g4dn (T4 GPU) | $0.74 - $7.82 | $0.22 - $2.35 | ~70% | Low (< 5%) |
| ml.g5 (A10G GPU) | $1.52 - $24.32 | $0.46 - $7.30 | ~70% | Moderate (5-15%) |
| ml.p3 (V100 GPU) | $3.83 - $28.15 | $1.15 - $8.45 | ~70% | Moderate (5-15%) |
| ml.p4d (A100 GPU) | $37.69 | $11.31 | ~70% | Higher (10-20%) |
The savings are real. So is the interruption risk, especially on larger GPU instances. The checkpoint and retry strategy from the Training Stage section makes this work. Without it, spot instances are a gamble. With it, they are a legitimate 70% cost reduction.
Managed Warm Pools
SageMaker Managed Warm Pools keep instances provisioned between training jobs. No more 3-5 minute cold starts. This matters most for iterative development and hyperparameter sweeps where you launch job after job after job.
| Scenario | Without Warm Pools | With Warm Pools | Time Saved |
|---|---|---|---|
| 20-job HP sweep (sequential) | 60-100 min startup overhead | ~5 min total | 55-95 min |
| Daily retraining | 3-5 min startup per run | ~0 min | 3-5 min/day |
| Interactive development | 3-5 min per experiment | ~0 min | Significant |
You pay for warm instances while they sit idle, so set KeepAlivePeriodInSeconds to match your actual usage. For hyperparameter sweeps, 3600 seconds (1 hour) works well. For daily retraining? Paying for 23 idle hours to save 4 minutes of startup is a bad trade. Do the math on your own workload.
Standard vs. Express Workflows
Step Functions has two workflow types. The cost models are completely different:
| Characteristic | Standard Workflow | Express Workflow |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Pricing | Per state transition ($0.025/1000) | Per request + duration |
| Execution history | 90 days built-in | CloudWatch Logs only |
| Exactly-once execution | Yes | At-least-once |
| Suitable for training pipelines? | Yes: primary pipeline | No: too short for training |
| Suitable for data validation? | Overkill | Yes: fast, cheap |
I use a hybrid approach. The main pipeline runs as a Standard workflow because training jobs are long-running and I need exactly-once semantics. The lightweight data validation sub-workflow runs as an Express workflow nested inside. Fast, cheap, and I do not care if it runs at-least-once because the validation is idempotent anyway. Best of both worlds.
Monitoring and Observability
You need visibility into every stage of the pipeline. When a model starts misbehaving in production, the first question is always "what changed?" Was it the data? The training? The infrastructure? Without observability wired in from the start, you are guessing.
CloudWatch Integration
SageMaker training jobs emit metrics to CloudWatch automatically. Here is what I monitor and alert on:
| Metric | Source | Alert Threshold |
|---|---|---|
train:loss | Training algorithm | Loss increasing for > 3 epochs |
validation:accuracy | Training algorithm | Below baseline by > 5% |
GPUUtilization | SageMaker infrastructure | Below 50% (over-provisioned) or above 95% (throttled) |
MemoryUtilization | SageMaker infrastructure | Above 90% (OOM risk) |
DiskUtilization | SageMaker infrastructure | Above 80% (checkpoint storage) |
Step Functions Execution History
Step Functions keeps execution history for 90 days on Standard workflows. For each execution, you get:
- Input/output for every state: What parameters were passed, what results returned
- Duration per state: Where the pipeline spends its time
- Error details: Full error messages and stack traces for failed states
- Retry attempts: How many retries occurred and why
My CloudWatch dashboards track:
- Pipeline success rate: Percentage of executions that complete successfully
- Stage duration trends: Whether training jobs are getting slower over time
- Cost per execution: Total compute cost tracked via tags
- Failure categorization: Which failure categories are most common
Custom Metrics Pipeline
The built-in metrics only get you so far. I push custom metrics from each pipeline stage:
| Stage | Custom Metrics |
|---|---|
| Data preparation | Records processed, feature count, null ratio |
| Training | Final loss, best metric, total epochs, early stopping epoch |
| Evaluation | All comparison metrics, pass/fail decision |
| Deployment | Endpoint creation time, smoke test latency |
All of this feeds a single CloudWatch dashboard. One screen, full picture of pipeline health. Alarms trigger SNS notifications straight to the ML engineering team's Slack channel. When something breaks at 2 AM, the on-call engineer can see exactly which stage failed and why before they even open the console.
Production Patterns
Everything above comes together in a production pipeline that handles the full lifecycle. Trigger, train, evaluate, deploy, monitor, retrain. Over and over.
Blue/Green Model Deployment
Deploying a new model to a SageMaker endpoint is slow. You cannot just swap models and hope for the best. The blue/green pattern gives you zero-downtime deployment:
- Create new endpoint configuration with the new model
- Update the endpoint: SageMaker provisions new instances behind the load balancer
- Run smoke tests against the new instances
- Shift traffic: SageMaker routes traffic to new instances and drains old ones
- Monitor for regression: Watch error rates and latency for 15-30 minutes
- Complete or rollback: If metrics are stable, decommission old instances; if metrics degrade, shift traffic back
In Step Functions, this maps to a series of states with Wait states for stabilization periods and Choice states for go/no-go decisions. Clean and auditable.
A/B Testing Integration
Sometimes you need to measure business impact before committing to a full deployment. SageMaker endpoint production variants let you run A/B tests:
| Configuration | Use Case | Traffic Split |
|---|---|---|
| Shadow mode | Compare predictions without affecting users | 100% to production, 100% mirrored to candidate |
| Canary deployment | Gradual rollout with metrics monitoring | 95/5 → 90/10 → 50/50 → 100/0 |
| Multi-variant | Test multiple model versions simultaneously | Equal split across variants |
The Step Functions workflow creates the multi-variant endpoint configuration and then waits. Literally. Wait states collect sufficient data before the pipeline makes a promotion decision. Patience built into the architecture.
Automated Retraining Triggers
Models degrade. Data distributions shift. User behavior changes. The pipeline needs to run continuously as part of a feedback loop. Automated retraining triggers close that loop:
| Trigger | Source | Pipeline Action |
|---|---|---|
| Schedule | EventBridge rule (daily/weekly) | Full pipeline execution |
| Data drift | Model Monitor | Pipeline execution with drift flag |
| Performance degradation | CloudWatch alarm | Pipeline execution with urgency flag |
| New data volume | S3 event notification | Pipeline execution after threshold |
| Manual | Console/API/CI-CD | Pipeline execution with custom parameters |
EventBridge rules trigger the Step Functions state machine. The input event includes the trigger type, and the pipeline adjusts its behavior accordingly. A drift-triggered retraining uses a larger dataset window than a scheduled retraining, for instance. The trigger context drives the pipeline configuration without requiring separate pipeline definitions for each scenario.
The Complete Production Pipeline
Here is the complete end-to-end production pipeline:
flowchart LR
A[EventBridge
Trigger] --> B[Data
Preparation]
B --> C[Feature
Engineering]
C --> D[Training
Map State]
D --> E[Model
Selection]
E --> F[Evaluation
vs Baseline]
F --> G{Auto-Promote
or Review?}
G -->|Auto| H[Register
Model]
G -->|Review| I[Human
Approval]
I -->|Approved| H
I -->|Rejected| J[Archive
& Alert]
H --> K[Blue/Green
Deploy]
K --> L[Smoke
Tests]
L --> M{Tests
Pass?}
M -->|Yes| N[Shift
Traffic]
M -->|No| O[Rollback
Endpoint]
N --> P[Model
Monitor]
P --> Q{Drift or
Degradation?}
Q -->|Yes| A
Q -->|No| P Standard Step Functions workflow at the top level, nested Express workflows for data validation substeps. End to end, a medium-sized training job takes 2-4 hours from data ingestion to production traffic. Training dominates that time. Everything else finishes in minutes.
Key Production Considerations
Before you deploy any of this, sort out these operational requirements. I have seen teams skip these and regret it within a month:
| Requirement | Implementation |
|---|---|
| IAM least privilege | Separate execution roles for pipeline, training, and deployment |
| VPC networking | Training jobs in private subnets with VPC endpoints for S3, ECR, CloudWatch (see Best Practices for Networking in AWS SageMaker) |
| Artifact encryption | KMS customer-managed keys for S3 model artifacts and EBS volumes |
| Audit logging | CloudTrail for API calls, Step Functions history for pipeline decisions |
| Cost tagging | Consistent tags across all resources: project, pipeline, execution-id |
| Quotas | Request limit increases for concurrent training jobs, endpoints, and processing jobs |
Additional Resources
- AWS Step Functions Developer Guide: Service Integrations with SageMaker
- SageMaker Training: Managed Spot Training
- SageMaker Training: Managed Warm Pools
- SageMaker Training: Distributed Training
- SageMaker Model Registry
- SageMaker Model Monitor
- Step Functions: Error Handling
- Step Functions: Map State
- Step Functions: Callback Pattern
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

