About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
Every platform that accepts user-uploaded video faces the same operational reality: a single piece of unmoderated content can produce legal liability, advertiser flight, and reputational damage that takes months to repair. I have built content moderation systems for platforms processing thousands of hours of video per day, and the architectural pattern I keep returning to is a Step Functions orchestration layer coordinating AWS managed AI services. Rekognition scans frames for nudity, violence, hate symbols, and other policy violations; it also identifies celebrities and labels objects and scenes. Transcribe pulls the audio track into a timestamped transcript. Step Functions ties these asynchronous, variable-duration jobs into a single deterministic pipeline that writes a structured metadata package back to S3 alongside the original video. This article is the architecture reference for that pipeline: the service integrations, the ASL definitions, the failure modes, the cost model, and the operational lessons that only surface under production load.
The Problem: Video at Scale
Platform-Scale Moderation Requirements
A platform accepting user-uploaded video needs to answer several questions about every piece of content before it goes live. Does the video contain nudity, violence, or other policy-violating material? Does it feature recognizable public figures whose likeness creates licensing or consent obligations? What objects and scenes appear in the video, and how should those feed into content classification and recommendation systems? What is being said in the audio track, and does the spoken content violate community guidelines?
Manual review cannot keep pace with upload volume. A human moderator watching video in real time processes one minute of content per minute of labor. At $20/hour loaded cost, that is $0.33 per minute of video. A platform receiving 10,000 hours of uploads per day would need over 10,000 full-time moderators just to keep up, at an annual cost exceeding $400 million. Automated analysis running in parallel across multiple ML dimensions reduces that first-pass cost by two orders of magnitude and reserves human review for the ambiguous cases that actually require judgment.
Why Orchestration Matters
Each of these ML services (Rekognition Content Moderation, Rekognition Celebrity Recognition, Rekognition Label Detection, Amazon Transcribe) operates asynchronously. You submit a job, receive a job identifier, and poll or wait for completion. Job durations vary based on video length, resolution, and service load. A naive implementation scatters this coordination logic across Lambda functions, SQS queues, and DynamoDB status tables. Within six months, nobody can reason about the end-to-end flow, retry logic is inconsistent, and partial failures leave orphaned jobs consuming resources.
Step Functions eliminates this class of problems. The entire pipeline (trigger, fan-out, parallel execution, polling, aggregation, output) lives in a single ASL definition. Every execution is visible in the console. Retry policies are declarative. Error handling is consistent across all branches. For a deeper treatment of Step Functions internals, see AWS Step Functions: An Architecture Deep-Dive.
Pipeline Architecture Overview
Component Inventory
The pipeline uses six AWS services, each performing a specific role:
| Service | Role | Integration Pattern |
|---|---|---|
| Amazon S3 | Video storage; trigger source; metadata output destination | EventBridge notification on object creation |
| Amazon EventBridge | Event routing from S3 to Step Functions | Rule matching on S3 object-created events |
| AWS Step Functions | Workflow orchestration; parallel fan-out; polling; aggregation | Standard workflow (exactly-once, durable) |
| Amazon Rekognition Video | Content moderation, celebrity recognition, label detection | SDK integration (Request Response) with polling loop |
| Amazon Transcribe | Audio-to-text transcription | SDK integration (Request Response) with polling loop |
| AWS Lambda | Result aggregation and metadata assembly | Optimized integration from Step Functions |
No custom ML models. No container infrastructure. No GPU instances. Every ML capability is a managed API call billed per minute of video analyzed. The only compute you manage is a single Lambda function that assembles the final metadata package.
Data Flow
The pipeline follows a linear trigger-to-output path with a parallel fan-out in the middle.
graph LR
A[S3 Bucket
Video Upload] -->|Object Created| B[EventBridge
Rule]
B -->|Start Execution| C[Step Functions
Workflow]
C --> D[Parallel
Analysis]
D --> E1[Rekognition
Content Moderation]
D --> E2[Rekognition
Celebrity Recognition]
D --> E3[Rekognition
Label Detection]
D --> E4[Amazon
Transcribe]
E1 --> F[Lambda
Aggregate Results]
E2 --> F
E3 --> F
E4 --> F
F --> G[S3 Bucket
Metadata JSON] A video lands in S3. EventBridge picks up the object-created notification and starts a Step Functions execution, passing the bucket name and object key as input. The workflow fans out into four parallel branches. Each branch starts an async analysis job, polls for completion, and retrieves the results. When all four branches complete, a Lambda function aggregates the results into a unified metadata JSON document and writes it to S3 alongside the original video.
Triggering the Pipeline
S3 Event Notifications to EventBridge
S3 can deliver event notifications through three mechanisms: direct SNS/SQS/Lambda notification, S3 Event Notifications to EventBridge, and S3 Object Lambda. For this pipeline, EventBridge is the correct choice. It supports content-based filtering (match on file extension, prefix, size), routes to Step Functions as a first-class target, and provides dead-letter queue support for failed deliveries.
Enable EventBridge notifications on the S3 bucket through the bucket properties. This is a one-time configuration that tells S3 to publish all events to EventBridge in addition to any legacy notification configuration. For a thorough comparison of AWS event routing services, see AWS Event-Driven Messaging: SNS, SQS, EventBridge, and Beyond.
EventBridge Rule Configuration
The EventBridge rule filters for video uploads and targets the Step Functions state machine:
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {
"name": ["my-video-bucket"]
},
"object": {
"key": [{
"suffix": ".mp4"
}]
}
}
}
Add additional suffix patterns for other video formats (.mov, .avi, .mkv). The rule target is the Step Functions state machine ARN with an IAM role that grants states:StartExecution. EventBridge passes the full event detail as the execution input.
Input Transformation
The raw EventBridge event contains metadata you do not need in the workflow. Use an input transformer on the EventBridge target to extract only the bucket and key:
{
"InputPathsMap": {
"bucket": "$.detail.bucket.name",
"key": "$.detail.object.key",
"size": "$.detail.object.size"
},
"InputTemplate": "{\"bucket\": \"<bucket>\", \"key\": \"<key>\", \"size\": <size>}"
}
The Step Functions execution now receives a clean input object with three fields. Every downstream state can reference $.bucket and $.key without navigating nested event structures.
The Step Functions Workflow
Workflow Design
This pipeline requires a Standard workflow. Express workflows do not support the .waitForTaskToken pattern or execution durations longer than five minutes, and video analysis jobs routinely run for several minutes on longer content. Standard workflows provide exactly-once execution semantics, durable state checkpointing, and 90-day execution history retention. The cost difference is negligible at the scale of a per-video invocation (typically $0.025 per thousand state transitions).
For a detailed comparison of Standard vs. Express workflows and guidance on choosing between them, see AWS Step Functions: An Architecture Deep-Dive.
The top-level workflow structure is a Parallel state containing four branches, followed by a Task state that invokes the aggregation Lambda:
{
"Comment": "Video content moderation pipeline",
"StartAt": "AnalyzeVideo",
"States": {
"AnalyzeVideo": {
"Type": "Parallel",
"Branches": [
{ "StartAt": "StartContentModeration", "States": { "..." : {} } },
{ "StartAt": "StartCelebrityRecognition", "States": { "..." : {} } },
{ "StartAt": "StartLabelDetection", "States": { "..." : {} } },
{ "StartAt": "StartTranscription", "States": { "..." : {} } }
],
"ResultPath": "$.analysisResults",
"Next": "AggregateResults"
},
"AggregateResults": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:AggregateVideoMetadata",
"Next": "Done"
},
"Done": {
"Type": "Succeed"
}
}
}
The Parallel state runs all four branches concurrently. Each branch is an independent sub-state-machine with its own Start, Get, and Wait states. The Parallel state collects the output of all branches into an array at $.analysisResults and passes it to the aggregation function.
Parallel Analysis Branches
Each Rekognition branch follows the same three-step pattern:
- Start the async job (Request Response integration, returns immediately with a JobId)
- Wait a fixed interval (10-30 seconds, depending on expected video length)
- Get the job status; if still IN_PROGRESS, loop back to Wait; if SUCCEEDED, pass results forward
This polling pattern is necessary because Rekognition Video does not have an optimized .sync integration with Step Functions. The SDK integration is Request Response only. You call StartContentModeration, get back a JobId, and then poll GetContentModeration until the JobStatus field returns SUCCEEDED or FAILED.
Handling Async Rekognition Jobs
Here is the complete ASL for the Content Moderation branch, which illustrates the polling pattern all three Rekognition branches share:
{
"StartContentModeration": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:rekognition:startContentModeration",
"Parameters": {
"Video": {
"S3Object": {
"Bucket.$": "$.bucket",
"Name.$": "$.key"
}
},
"MinConfidence": 60,
"NotificationChannel": {
"SNSTopicArn": "arn:aws:sns:us-east-1:123456789012:RekognitionTopic",
"RoleArn": "arn:aws:iam::123456789012:role/RekognitionSNSRole"
}
},
"ResultPath": "$.moderationJob",
"Next": "WaitForModeration"
},
"WaitForModeration": {
"Type": "Wait",
"Seconds": 20,
"Next": "GetContentModeration"
},
"GetContentModeration": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:rekognition:getContentModeration",
"Parameters": {
"JobId.$": "$.moderationJob.JobId",
"MaxResults": 1000
},
"ResultPath": "$.moderationResults",
"Next": "CheckModerationStatus"
},
"CheckModerationStatus": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.moderationResults.JobStatus",
"StringEquals": "SUCCEEDED",
"Next": "ModerationComplete"
},
{
"Variable": "$.moderationResults.JobStatus",
"StringEquals": "FAILED",
"Next": "ModerationFailed"
}
],
"Default": "WaitForModeration"
},
"ModerationComplete": {
"Type": "Pass",
"End": true
},
"ModerationFailed": {
"Type": "Fail",
"Error": "ModerationJobFailed",
"Cause": "Rekognition content moderation job failed"
}
}
The MinConfidence parameter of 60 returns moderation labels with at least 60% confidence. Setting this lower increases recall (catches more potentially problematic content) at the cost of more false positives. Setting it higher reduces noise but risks missing borderline content. I default to 60 for first-pass automated screening and route anything flagged to human review.
Rekognition Analysis Branches
Content Moderation
Rekognition Content Moderation analyzes video frames for inappropriate or offensive visual content. The service uses a three-level hierarchical taxonomy:
| Level 1 Category | Example Level 2 Labels | Detection Scope |
|---|---|---|
| Explicit Nudity | Nudity, Graphic Male Nudity, Graphic Female Nudity, Sexual Activity | Full and partial nudity, sexual content |
| Non-Explicit Nudity | Revealing Clothes, Male Swimwear, Female Swimwear | Suggestive but not explicit content |
| Violence | Graphic Violence, Physical Violence, Weapon Violence, Self Injury | Combat, weapons, blood, injury |
| Visually Disturbing | Emaciated Bodies, Corpses, Hanging, Air Crash | Graphic injury, death, disaster |
| Drugs & Tobacco | Drug Use, Drug Paraphernalia, Tobacco Products | Substance use and paraphernalia |
| Alcohol | Drinking, Alcoholic Beverages, Beer, Wine | Alcohol consumption and products |
| Gambling | Gambling, Casino | Gaming and betting |
| Hate Symbols | Nazi Party, White Supremacy, Extremist | Symbols associated with hate groups |
| Rude Gestures | Middle Finger | Offensive hand gestures |
The response includes a ModerationLabels array where each entry contains the label name, its parent label (hierarchy position), the confidence score, and the timestamp in milliseconds where the content was detected. This timestamp precision lets you build frame-accurate moderation: flag specific segments of a video rather than rejecting the entire file.
Celebrity Recognition
Rekognition Celebrity Recognition identifies tens of thousands of public figures across entertainment, sports, politics, business, and media. The StartCelebrityRecognition / GetCelebrityRecognition API pair follows the same async pattern as Content Moderation.
The response includes, for each detected celebrity: name, unique Rekognition ID, confidence score, bounding box coordinates, face landmarks, known gender, and an array of URLs pointing to external reference information (typically IMDb or Wikipedia). The timestamp field indicates when the celebrity appears in the video.
The ASL for this branch is structurally identical to the Content Moderation branch. Replace startContentModeration with startCelebrityRecognition and getContentModeration with getCelebrityRecognition in the resource ARNs. The parameter block for the Start call is simpler since Celebrity Recognition takes no confidence threshold or additional configuration beyond the video source.
Celebrity detection serves two purposes in a moderation pipeline. First, it identifies individuals whose appearance in user-generated content may have legal implications (right of publicity, defamation risk). Second, it feeds content classification: a video featuring a specific athlete is likely sports content; a video featuring a politician is likely news or political commentary. This metadata improves downstream categorization and recommendation quality.
Label Detection
Rekognition Label Detection identifies objects, scenes, activities, and concepts in video frames. The service returns labels such as "Car," "Beach," "Running," "Crowd," "Dog," and thousands of others, each with a confidence score and a bounding box (for objects) or timestamp (for activities and scenes).
The StartLabelDetection API accepts optional parameters:
| Parameter | Default | Description |
|---|---|---|
MinConfidence | 50 | Minimum confidence threshold for returned labels |
Features | None | Set to GENERAL_LABELS for standard detection |
Settings.GeneralLabels.LabelInclusionFilters | All | Array of specific labels to detect (ignores all others) |
Settings.GeneralLabels.LabelExclusionFilters | None | Array of labels to exclude from results |
Label Detection is the broadest of the three Rekognition APIs. Content Moderation and Celebrity Recognition have narrow outputs (flags and names). Label Detection produces a dense metadata stream: every object in every analyzed frame generates a result entry. For a 10-minute video, expect thousands of label entries. The MaxResults parameter on the Get call controls pagination (maximum 1000 per page), and you should implement NextToken-based pagination in the polling step to capture the full result set.
Bringing Results Together
When all three Rekognition branches and the Transcribe branch complete, the Parallel state collects their outputs into a four-element array. The array ordering matches the branch ordering in the ASL definition. Branch 0 contains content moderation results, Branch 1 contains celebrity results, Branch 2 contains label results, and Branch 3 contains the transcription output.
The aggregation Lambda receives this array, restructures it into a unified metadata document, and writes the document to S3:
{
"videoKey": "uploads/2026/02/25/user-video-abc123.mp4",
"analyzedAt": "2026-02-25T14:30:00Z",
"moderation": {
"flagged": true,
"labels": [
{
"label": "Violence",
"confidence": 87.3,
"timestampMs": 45200
}
]
},
"celebrities": [
{
"name": "John Smith",
"confidence": 99.1,
"timestampMs": 12000,
"urls": ["https://www.imdb.com/name/nm0000001/"]
}
],
"labels": {
"summary": ["Car", "Road", "Person", "Building"],
"detailed": [ "..." ]
},
"transcript": {
"status": "COMPLETED",
"outputUri": "s3://my-video-bucket/transcripts/user-video-abc123.json"
},
"pipelineExecutionArn": "arn:aws:states:us-east-1:123456789012:execution:VideoModerationPipeline:abc-123"
}
The metadata file is written to a predictable path derived from the original video key: replace the file extension with .metadata.json, or write to a parallel prefix (e.g., metadata/ instead of uploads/). Downstream consumers (moderation dashboards, recommendation engines, search indexers) read the metadata file without needing to invoke any ML services themselves.
Transcription with Amazon Transcribe
Starting a Transcription Job
Amazon Transcribe extracts spoken audio from video files and produces a timestamped transcript. The StartTranscriptionJob API accepts video files directly from S3; no pre-processing or audio extraction step is required. Transcribe handles the audio stream extraction internally.
The Step Functions SDK integration for Transcribe follows the same Request Response pattern as Rekognition:
{
"StartTranscription": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:transcribe:startTranscriptionJob",
"Parameters": {
"TranscriptionJobName.$": "States.Format('moderation-{}', $.key)",
"LanguageCode": "en-US",
"Media": {
"MediaFileUri.$": "States.Format('s3://{}/{}', $.bucket, $.key)"
},
"OutputBucketName.$": "$.bucket",
"OutputKey.$": "States.Format('transcripts/{}.json', $.key)",
"Settings": {
"ShowSpeakerLabels": true,
"MaxSpeakerLabels": 10
}
},
"ResultPath": "$.transcriptionJob",
"Next": "WaitForTranscription"
}
}
The States.Format intrinsic function constructs the S3 URI and output key dynamically from the execution input. ShowSpeakerLabels enables speaker diarization, which identifies and labels different speakers in the audio. This is valuable for moderation because it lets you attribute specific statements to specific speakers in a conversation.
Polling for Completion
Transcribe jobs, like Rekognition jobs, run asynchronously. The polling pattern is the same: Wait, Get status, Check, loop or proceed.
{
"WaitForTranscription": {
"Type": "Wait",
"Seconds": 30,
"Next": "GetTranscriptionStatus"
},
"GetTranscriptionStatus": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:transcribe:getTranscriptionJob",
"Parameters": {
"TranscriptionJobName.$": "$.transcriptionJob.TranscriptionJob.TranscriptionJobName"
},
"ResultPath": "$.transcriptionStatus",
"Next": "CheckTranscriptionStatus"
},
"CheckTranscriptionStatus": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.transcriptionStatus.TranscriptionJob.TranscriptionJobStatus",
"StringEquals": "COMPLETED",
"Next": "TranscriptionComplete"
},
{
"Variable": "$.transcriptionStatus.TranscriptionJob.TranscriptionJobStatus",
"StringEquals": "FAILED",
"Next": "TranscriptionFailed"
}
],
"Default": "WaitForTranscription"
}
}
Transcription jobs typically complete in 20-50% of the video's duration. A 10-minute video usually finishes transcription within 2-5 minutes. The 30-second polling interval balances responsiveness against API call volume. For very long videos (over one hour), increase the polling interval to 60 seconds to reduce unnecessary GetTranscriptionJob calls.
Retrieving and Storing the Transcript
Unlike Rekognition, Transcribe writes its output directly to S3 at the location specified in OutputBucketName and OutputKey. The transcription branch does not need to retrieve and pass results through the Step Functions execution. It confirms completion and passes the output location forward. The aggregation Lambda reads the transcript from S3 if it needs to extract specific content for the metadata package, or it simply records the transcript location in the metadata document.
Transcribe output includes word-level timestamps, confidence scores per word, punctuation, and speaker labels. This granularity enables transcript-based moderation: flag specific time ranges where profanity, hate speech, or other policy-violating language appears. Combine this with Rekognition Content Moderation timestamps for a complete picture of which segments of the video require human review.
Pricing and Cost Modeling
Per-Service Pricing
Every service in this pipeline bills per unit of content processed. No provisioned capacity, no hourly instance costs, no idle charges.
| Service | Unit | Price | Free Tier |
|---|---|---|---|
| Rekognition Video (all APIs) | Per minute of video | $0.10/min per API | 60 min/month for 12 months |
| Amazon Transcribe (batch) | Per second of audio (15s minimum) | $0.024/min ($0.0004/sec) | 60 min/month for 12 months |
| Step Functions (Standard) | Per state transition | $0.000025/transition | 4,000 transitions/month |
| Lambda | Per request + duration | $0.20/1M requests + $0.0000166667/GB-sec | 1M requests + 400K GB-sec/month |
| EventBridge | Per event published | $1.00/1M events | Free (custom events are $1/M) |
| S3 (storage + requests) | Per GB stored + per request | $0.023/GB + $0.005/1K PUT | 5 GB for 12 months |
The dominant cost is Rekognition Video at $0.10 per minute per API. Running three Rekognition APIs (Content Moderation, Celebrity Recognition, Label Detection) against the same video means $0.30 per minute of video. Transcribe adds $0.024 per minute. Step Functions, Lambda, EventBridge, and S3 request costs are negligible at typical volumes.
Cost Model for a Typical Pipeline Run
| Video Duration | Rekognition (3 APIs) | Transcribe | Step Functions (~40 transitions) | Total per Video |
|---|---|---|---|---|
| 1 minute | $0.30 | $0.024 | $0.001 | $0.33 |
| 5 minutes | $1.50 | $0.12 | $0.001 | $1.62 |
| 10 minutes | $3.00 | $0.24 | $0.001 | $3.24 |
| 30 minutes | $9.00 | $0.72 | $0.001 | $9.72 |
| 60 minutes | $18.00 | $1.44 | $0.001 | $19.44 |
At scale, cost accumulates quickly. Processing 1,000 ten-minute videos per day costs approximately $3,240 per day ($97,200 per month) in Rekognition and Transcribe charges alone. For strategies on managing S3 storage costs as metadata and transcripts accumulate, see AWS S3 Cost Optimization: The Complete Savings Playbook.
Optimization Strategies
Several techniques reduce per-video cost without sacrificing coverage:
Selective API invocation. Not every video needs all four analysis types. Add a classification step at the beginning of the workflow that checks video metadata (uploader trust level, content category, file size) and routes to a reduced set of APIs. Trusted uploaders with established track records might skip Celebrity Recognition. Short clips under 15 seconds might skip Transcribe.
Frame sampling for Rekognition. Rekognition Video analyzes frames at a sampling rate it determines internally. For Label Detection, where you need broad categorization rather than frame-accurate tracking, consider extracting keyframes with a lightweight Lambda + FFmpeg step and running Rekognition Image APIs ($0.001 per image for Label Detection) instead of the Video API. For a 10-minute video with one keyframe per second, that is 600 images at $0.60 total vs. $1.00 for the Video API. This approach reduces cost by 40% for Label Detection specifically.
Confidence threshold tuning. Higher MinConfidence values reduce the volume of results stored and processed downstream. For Content Moderation, a threshold of 80 dramatically reduces false positives at the cost of a small increase in missed detections. Tune this based on your platform's risk tolerance and human review capacity.
Transcribe language detection. If your platform handles multilingual content, enable automatic language detection in Transcribe rather than running separate jobs for each possible language. This avoids wasted Transcribe minutes on incorrect language guesses.
Operational Considerations
Failure Modes
Production pipelines fail. The question is whether failures are observable, recoverable, and contained. Here are the failure modes I have encountered in production:
| Failure Mode | Symptom | Mitigation |
|---|---|---|
| Rekognition throttling | ProvisionedThroughputExceededException on Start calls | Exponential backoff retry in ASL (3 retries, 2x backoff, 1s initial interval) |
| Video format unsupported | Rekognition returns InvalidParameterException | Validate format (H.264 in MP4/MOV) before starting analysis; add a format-check state |
| Video exceeds 10 GB | Start call rejected | Check $.size from EventBridge input and route oversized files to a separate processing path or rejection state |
| Transcribe job name collision | ConflictException on StartTranscriptionJob | Append a unique suffix (execution ID or UUID) to the job name |
| S3 write failure on metadata | Lambda error on PutObject | Retry the aggregation Lambda with a Catch block; ensure IAM permissions include s3:PutObject |
| Parallel branch timeout | One branch runs far longer than expected | Set TimeoutSeconds on each branch's polling loop (e.g., 3600 seconds for one-hour max) |
| Rekognition job stuck IN_PROGRESS | Polling loop never exits | Combine TimeoutSeconds on the polling states with a maximum iteration count in a Choice state |
Add retry configuration to every Task state that makes an AWS API call:
{
"Retry": [
{
"ErrorEquals": ["Rekognition.ThrottlingException", "Rekognition.ProvisionedThroughputExceededException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
},
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 5,
"MaxAttempts": 2,
"BackoffRate": 2.0
}
]
}
For a broader treatment of error-handling patterns in Step Functions workflows, see Step Functions for Cart and Fulfillment: Async Workflow Patterns That Survive Production.
Monitoring and Alerting
Step Functions provides execution-level visibility out of the box. Every state transition, every input/output payload, every error is recorded in the execution history. For pipeline-level monitoring, track these CloudWatch metrics:
| Metric | Source | Alert Threshold |
|---|---|---|
ExecutionsFailed | Step Functions | Any failure (threshold: 1) |
ExecutionsTimedOut | Step Functions | Any timeout (threshold: 1) |
ExecutionThrottled | Step Functions | Sustained throttling (threshold: 5 in 5 minutes) |
ExecutionTime | Step Functions | P99 execution time exceeds 2x expected duration |
Errors (Lambda aggregation function) | Lambda | Any invocation error |
| Custom metric: videos processed | Lambda/CloudWatch | Drop below baseline daily volume |
Publish a custom CloudWatch metric from the aggregation Lambda that records whether content moderation flagged the video. Track the flag rate over time. A sudden spike in the flag rate may indicate a coordinated abuse campaign. A sudden drop may indicate a Rekognition model issue or a confidence threshold misconfiguration.
Scaling Limits and Throttling
The binding constraint on pipeline throughput is Rekognition Video's concurrent job limit: 20 concurrent jobs per account per region by default. Each Rekognition API call against a video counts as one job. A single pipeline execution that calls three Rekognition APIs consumes three concurrent job slots. With the default limit, you can run approximately six pipeline executions concurrently before hitting the ceiling.
| Resource | Default Limit | Adjustable | Request Method |
|---|---|---|---|
| Rekognition concurrent video jobs | 20 per account | Yes | Service Quotas console |
| Rekognition video size | 10 GB | No | Fixed |
| Rekognition video duration | 6 hours | No | Fixed |
| Transcribe concurrent batch jobs | 250 per account | Yes | Service Quotas console |
| Step Functions concurrent executions | Unlimited (Standard) | N/A | N/A |
| Step Functions StartExecution TPS | 2,000 | Yes | Service Quotas console |
Request a limit increase for Rekognition concurrent video jobs before going to production. For pipelines processing hundreds of videos per hour, a limit of 100-200 concurrent jobs is typical. AWS grants these increases readily since the limit exists primarily to prevent accidental runaway costs rather than to protect service capacity.
If you hit the concurrent job limit, Step Functions executions will fail at the StartContentModeration (or similar) Task state with a LimitExceededException. The retry configuration catches this and backs off, but sustained overload causes retries to exhaust and executions to fail. For high-throughput scenarios, add a queue (SQS) between EventBridge and Step Functions that controls the concurrency of pipeline executions. Use a Lambda function that reads from SQS and starts Step Functions executions at a controlled rate, respecting the Rekognition concurrent job limit.
Bringing It All Together
graph TD
Start([Start]) --> Validate[Validate Input
Check size and format]
Validate -->|Valid| Parallel
Validate -->|Invalid| Reject[Reject:
Write error to S3]
subgraph Parallel [Parallel Analysis]
direction TB
CM_Start[Start Content
Moderation] --> CM_Wait[Wait 20s]
CM_Wait --> CM_Get[Get Content
Moderation]
CM_Get --> CM_Check{Job Status?}
CM_Check -->|IN_PROGRESS| CM_Wait
CM_Check -->|SUCCEEDED| CM_Done([Done])
CM_Check -->|FAILED| CM_Fail([Fail])
CR_Start[Start Celebrity
Recognition] --> CR_Wait[Wait 20s]
CR_Wait --> CR_Get[Get Celebrity
Recognition]
CR_Get --> CR_Check{Job Status?}
CR_Check -->|IN_PROGRESS| CR_Wait
CR_Check -->|SUCCEEDED| CR_Done([Done])
CR_Check -->|FAILED| CR_Fail([Fail])
LD_Start[Start Label
Detection] --> LD_Wait[Wait 20s]
LD_Wait --> LD_Get[Get Label
Detection]
LD_Get --> LD_Check{Job Status?}
LD_Check -->|IN_PROGRESS| LD_Wait
LD_Check -->|SUCCEEDED| LD_Done([Done])
LD_Check -->|FAILED| LD_Fail([Fail])
TX_Start[Start
Transcription] --> TX_Wait[Wait 30s]
TX_Wait --> TX_Get[Get Transcription
Status]
TX_Get --> TX_Check{Job Status?}
TX_Check -->|IN_PROGRESS| TX_Wait
TX_Check -->|COMPLETED| TX_Done([Done])
TX_Check -->|FAILED| TX_Fail([Fail])
end
Parallel --> Aggregate[Lambda: Aggregate
Results]
Aggregate --> WriteS3[Write Metadata
JSON to S3]
WriteS3 --> End([End])
Reject --> End The complete pipeline, from S3 upload to metadata file, typically completes in 1-5 minutes for a 10-minute video. Rekognition Content Moderation is usually the longest-running branch because it performs frame-level analysis across the full video duration. Celebrity Recognition and Label Detection complete faster because they operate on sampled frames. Transcribe completion time scales roughly linearly with audio duration.
This architecture handles the common case (a video that passes moderation) and the flagged case (a video that contains policy-violating content) identically from an orchestration perspective. The pipeline always runs all analyses and always produces a complete metadata package. The decision about whether to publish, quarantine, or reject the video happens downstream, informed by the metadata. Separating analysis from decision-making keeps the pipeline simple and the business logic in a system designed for policy rules rather than ML orchestration.
The pipeline is deterministic, observable, and cost-transparent. Every execution produces the same metadata structure. Every failure is visible in the Step Functions console with full context. Every dollar of ML cost traces back to a specific video and a specific API call. For content moderation at platform scale, that operational clarity is worth more than any particular ML model's accuracy score.
Additional Resources
- Amazon Rekognition Video Documentation
- Amazon Rekognition Content Moderation
- Amazon Transcribe Developer Guide
- Step Functions SDK Service Integrations
- Step Functions Service Integration Patterns
- Amazon Rekognition Pricing
- EventBridge S3 Integration Tutorial
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

