Skip to main content

Video Content Moderation with Step Functions and AWS AI Services

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

Every platform that accepts user-uploaded video faces the same operational reality: a single piece of unmoderated content can produce legal liability, advertiser flight, and reputational damage that takes months to repair. I have built content moderation systems for platforms processing thousands of hours of video per day, and the architectural pattern I keep returning to is a Step Functions orchestration layer coordinating AWS managed AI services. Rekognition scans frames for nudity, violence, hate symbols, and other policy violations; it also identifies celebrities and labels objects and scenes. Transcribe pulls the audio track into a timestamped transcript. Step Functions ties these asynchronous, variable-duration jobs into a single deterministic pipeline that writes a structured metadata package back to S3 alongside the original video. This article is the architecture reference for that pipeline: the service integrations, the ASL definitions, the failure modes, the cost model, and the operational lessons that only surface under production load.

The Problem: Video at Scale

Platform-Scale Moderation Requirements

A platform accepting user-uploaded video needs to answer several questions about every piece of content before it goes live. Does the video contain nudity, violence, or other policy-violating material? Does it feature recognizable public figures whose likeness creates licensing or consent obligations? What objects and scenes appear in the video, and how should those feed into content classification and recommendation systems? What is being said in the audio track, and does the spoken content violate community guidelines?

Manual review cannot keep pace with upload volume. A human moderator watching video in real time processes one minute of content per minute of labor. At $20/hour loaded cost, that is $0.33 per minute of video. A platform receiving 10,000 hours of uploads per day would need over 10,000 full-time moderators just to keep up, at an annual cost exceeding $400 million. Automated analysis running in parallel across multiple ML dimensions reduces that first-pass cost by two orders of magnitude and reserves human review for the ambiguous cases that actually require judgment.

Why Orchestration Matters

Each of these ML services (Rekognition Content Moderation, Rekognition Celebrity Recognition, Rekognition Label Detection, Amazon Transcribe) operates asynchronously. You submit a job, receive a job identifier, and poll or wait for completion. Job durations vary based on video length, resolution, and service load. A naive implementation scatters this coordination logic across Lambda functions, SQS queues, and DynamoDB status tables. Within six months, nobody can reason about the end-to-end flow, retry logic is inconsistent, and partial failures leave orphaned jobs consuming resources.

Step Functions eliminates this class of problems. The entire pipeline (trigger, fan-out, parallel execution, polling, aggregation, output) lives in a single ASL definition. Every execution is visible in the console. Retry policies are declarative. Error handling is consistent across all branches. For a deeper treatment of Step Functions internals, see AWS Step Functions: An Architecture Deep-Dive.

Pipeline Architecture Overview

Component Inventory

The pipeline uses six AWS services, each performing a specific role:

ServiceRoleIntegration Pattern
Amazon S3Video storage; trigger source; metadata output destinationEventBridge notification on object creation
Amazon EventBridgeEvent routing from S3 to Step FunctionsRule matching on S3 object-created events
AWS Step FunctionsWorkflow orchestration; parallel fan-out; polling; aggregationStandard workflow (exactly-once, durable)
Amazon Rekognition VideoContent moderation, celebrity recognition, label detectionSDK integration (Request Response) with polling loop
Amazon TranscribeAudio-to-text transcriptionSDK integration (Request Response) with polling loop
AWS LambdaResult aggregation and metadata assemblyOptimized integration from Step Functions

No custom ML models. No container infrastructure. No GPU instances. Every ML capability is a managed API call billed per minute of video analyzed. The only compute you manage is a single Lambda function that assembles the final metadata package.

Data Flow

The pipeline follows a linear trigger-to-output path with a parallel fan-out in the middle.

graph LR
    A[S3 Bucket
Video Upload] -->|Object Created| B[EventBridge
Rule] B -->|Start Execution| C[Step Functions
Workflow] C --> D[Parallel
Analysis] D --> E1[Rekognition
Content Moderation] D --> E2[Rekognition
Celebrity Recognition] D --> E3[Rekognition
Label Detection] D --> E4[Amazon
Transcribe] E1 --> F[Lambda
Aggregate Results] E2 --> F E3 --> F E4 --> F F --> G[S3 Bucket
Metadata JSON]
Video content moderation pipeline architecture

A video lands in S3. EventBridge picks up the object-created notification and starts a Step Functions execution, passing the bucket name and object key as input. The workflow fans out into four parallel branches. Each branch starts an async analysis job, polls for completion, and retrieves the results. When all four branches complete, a Lambda function aggregates the results into a unified metadata JSON document and writes it to S3 alongside the original video.

Triggering the Pipeline

S3 Event Notifications to EventBridge

S3 can deliver event notifications through three mechanisms: direct SNS/SQS/Lambda notification, S3 Event Notifications to EventBridge, and S3 Object Lambda. For this pipeline, EventBridge is the correct choice. It supports content-based filtering (match on file extension, prefix, size), routes to Step Functions as a first-class target, and provides dead-letter queue support for failed deliveries.

Enable EventBridge notifications on the S3 bucket through the bucket properties. This is a one-time configuration that tells S3 to publish all events to EventBridge in addition to any legacy notification configuration. For a thorough comparison of AWS event routing services, see AWS Event-Driven Messaging: SNS, SQS, EventBridge, and Beyond.

EventBridge Rule Configuration

The EventBridge rule filters for video uploads and targets the Step Functions state machine:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["my-video-bucket"]
    },
    "object": {
      "key": [{
        "suffix": ".mp4"
      }]
    }
  }
}

Add additional suffix patterns for other video formats (.mov, .avi, .mkv). The rule target is the Step Functions state machine ARN with an IAM role that grants states:StartExecution. EventBridge passes the full event detail as the execution input.

Input Transformation

The raw EventBridge event contains metadata you do not need in the workflow. Use an input transformer on the EventBridge target to extract only the bucket and key:

{
  "InputPathsMap": {
    "bucket": "$.detail.bucket.name",
    "key": "$.detail.object.key",
    "size": "$.detail.object.size"
  },
  "InputTemplate": "{\"bucket\": \"<bucket>\", \"key\": \"<key>\", \"size\": <size>}"
}

The Step Functions execution now receives a clean input object with three fields. Every downstream state can reference $.bucket and $.key without navigating nested event structures.

The Step Functions Workflow

Workflow Design

This pipeline requires a Standard workflow. Express workflows do not support the .waitForTaskToken pattern or execution durations longer than five minutes, and video analysis jobs routinely run for several minutes on longer content. Standard workflows provide exactly-once execution semantics, durable state checkpointing, and 90-day execution history retention. The cost difference is negligible at the scale of a per-video invocation (typically $0.025 per thousand state transitions).

For a detailed comparison of Standard vs. Express workflows and guidance on choosing between them, see AWS Step Functions: An Architecture Deep-Dive.

The top-level workflow structure is a Parallel state containing four branches, followed by a Task state that invokes the aggregation Lambda:

{
  "Comment": "Video content moderation pipeline",
  "StartAt": "AnalyzeVideo",
  "States": {
    "AnalyzeVideo": {
      "Type": "Parallel",
      "Branches": [
        { "StartAt": "StartContentModeration", "States": { "..." : {} } },
        { "StartAt": "StartCelebrityRecognition", "States": { "..." : {} } },
        { "StartAt": "StartLabelDetection", "States": { "..." : {} } },
        { "StartAt": "StartTranscription", "States": { "..." : {} } }
      ],
      "ResultPath": "$.analysisResults",
      "Next": "AggregateResults"
    },
    "AggregateResults": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:AggregateVideoMetadata",
      "Next": "Done"
    },
    "Done": {
      "Type": "Succeed"
    }
  }
}

The Parallel state runs all four branches concurrently. Each branch is an independent sub-state-machine with its own Start, Get, and Wait states. The Parallel state collects the output of all branches into an array at $.analysisResults and passes it to the aggregation function.

Parallel Analysis Branches

Each Rekognition branch follows the same three-step pattern:

  1. Start the async job (Request Response integration, returns immediately with a JobId)
  2. Wait a fixed interval (10-30 seconds, depending on expected video length)
  3. Get the job status; if still IN_PROGRESS, loop back to Wait; if SUCCEEDED, pass results forward

This polling pattern is necessary because Rekognition Video does not have an optimized .sync integration with Step Functions. The SDK integration is Request Response only. You call StartContentModeration, get back a JobId, and then poll GetContentModeration until the JobStatus field returns SUCCEEDED or FAILED.

Note
Rekognition Video also supports an SNS notification channel where it publishes a completion message when a job finishes. You could use this with a `.waitForTaskToken` pattern: start the job, pause the workflow with a task token, have an SNS-triggered Lambda call `SendTaskSuccess` when the notification arrives. This avoids polling entirely. However, the polling pattern is simpler to implement, easier to debug, and the Wait state costs nothing. I use polling for pipelines processing fewer than 1,000 videos per day and switch to the callback pattern only when polling frequency creates meaningful Rekognition API costs at scale.

Handling Async Rekognition Jobs

Here is the complete ASL for the Content Moderation branch, which illustrates the polling pattern all three Rekognition branches share:

{
  "StartContentModeration": {
    "Type": "Task",
    "Resource": "arn:aws:states:::aws-sdk:rekognition:startContentModeration",
    "Parameters": {
      "Video": {
        "S3Object": {
          "Bucket.$": "$.bucket",
          "Name.$": "$.key"
        }
      },
      "MinConfidence": 60,
      "NotificationChannel": {
        "SNSTopicArn": "arn:aws:sns:us-east-1:123456789012:RekognitionTopic",
        "RoleArn": "arn:aws:iam::123456789012:role/RekognitionSNSRole"
      }
    },
    "ResultPath": "$.moderationJob",
    "Next": "WaitForModeration"
  },
  "WaitForModeration": {
    "Type": "Wait",
    "Seconds": 20,
    "Next": "GetContentModeration"
  },
  "GetContentModeration": {
    "Type": "Task",
    "Resource": "arn:aws:states:::aws-sdk:rekognition:getContentModeration",
    "Parameters": {
      "JobId.$": "$.moderationJob.JobId",
      "MaxResults": 1000
    },
    "ResultPath": "$.moderationResults",
    "Next": "CheckModerationStatus"
  },
  "CheckModerationStatus": {
    "Type": "Choice",
    "Choices": [
      {
        "Variable": "$.moderationResults.JobStatus",
        "StringEquals": "SUCCEEDED",
        "Next": "ModerationComplete"
      },
      {
        "Variable": "$.moderationResults.JobStatus",
        "StringEquals": "FAILED",
        "Next": "ModerationFailed"
      }
    ],
    "Default": "WaitForModeration"
  },
  "ModerationComplete": {
    "Type": "Pass",
    "End": true
  },
  "ModerationFailed": {
    "Type": "Fail",
    "Error": "ModerationJobFailed",
    "Cause": "Rekognition content moderation job failed"
  }
}

The MinConfidence parameter of 60 returns moderation labels with at least 60% confidence. Setting this lower increases recall (catches more potentially problematic content) at the cost of more false positives. Setting it higher reduces noise but risks missing borderline content. I default to 60 for first-pass automated screening and route anything flagged to human review.

Rekognition Analysis Branches

Content Moderation

Rekognition Content Moderation analyzes video frames for inappropriate or offensive visual content. The service uses a three-level hierarchical taxonomy:

Level 1 CategoryExample Level 2 LabelsDetection Scope
Explicit NudityNudity, Graphic Male Nudity, Graphic Female Nudity, Sexual ActivityFull and partial nudity, sexual content
Non-Explicit NudityRevealing Clothes, Male Swimwear, Female SwimwearSuggestive but not explicit content
ViolenceGraphic Violence, Physical Violence, Weapon Violence, Self InjuryCombat, weapons, blood, injury
Visually DisturbingEmaciated Bodies, Corpses, Hanging, Air CrashGraphic injury, death, disaster
Drugs & TobaccoDrug Use, Drug Paraphernalia, Tobacco ProductsSubstance use and paraphernalia
AlcoholDrinking, Alcoholic Beverages, Beer, WineAlcohol consumption and products
GamblingGambling, CasinoGaming and betting
Hate SymbolsNazi Party, White Supremacy, ExtremistSymbols associated with hate groups
Rude GesturesMiddle FingerOffensive hand gestures

The response includes a ModerationLabels array where each entry contains the label name, its parent label (hierarchy position), the confidence score, and the timestamp in milliseconds where the content was detected. This timestamp precision lets you build frame-accurate moderation: flag specific segments of a video rather than rejecting the entire file.

Note
Content Moderation pricing is $0.10 per minute of video analyzed. A 10-minute video costs $1.00 for moderation analysis alone. Plan your cost model around this per-minute rate multiplied by the number of Rekognition APIs you invoke per video.

Celebrity Recognition

Rekognition Celebrity Recognition identifies tens of thousands of public figures across entertainment, sports, politics, business, and media. The StartCelebrityRecognition / GetCelebrityRecognition API pair follows the same async pattern as Content Moderation.

The response includes, for each detected celebrity: name, unique Rekognition ID, confidence score, bounding box coordinates, face landmarks, known gender, and an array of URLs pointing to external reference information (typically IMDb or Wikipedia). The timestamp field indicates when the celebrity appears in the video.

The ASL for this branch is structurally identical to the Content Moderation branch. Replace startContentModeration with startCelebrityRecognition and getContentModeration with getCelebrityRecognition in the resource ARNs. The parameter block for the Start call is simpler since Celebrity Recognition takes no confidence threshold or additional configuration beyond the video source.

Celebrity detection serves two purposes in a moderation pipeline. First, it identifies individuals whose appearance in user-generated content may have legal implications (right of publicity, defamation risk). Second, it feeds content classification: a video featuring a specific athlete is likely sports content; a video featuring a politician is likely news or political commentary. This metadata improves downstream categorization and recommendation quality.

Label Detection

Rekognition Label Detection identifies objects, scenes, activities, and concepts in video frames. The service returns labels such as "Car," "Beach," "Running," "Crowd," "Dog," and thousands of others, each with a confidence score and a bounding box (for objects) or timestamp (for activities and scenes).

The StartLabelDetection API accepts optional parameters:

ParameterDefaultDescription
MinConfidence50Minimum confidence threshold for returned labels
FeaturesNoneSet to GENERAL_LABELS for standard detection
Settings.GeneralLabels.LabelInclusionFiltersAllArray of specific labels to detect (ignores all others)
Settings.GeneralLabels.LabelExclusionFiltersNoneArray of labels to exclude from results

Label Detection is the broadest of the three Rekognition APIs. Content Moderation and Celebrity Recognition have narrow outputs (flags and names). Label Detection produces a dense metadata stream: every object in every analyzed frame generates a result entry. For a 10-minute video, expect thousands of label entries. The MaxResults parameter on the Get call controls pagination (maximum 1000 per page), and you should implement NextToken-based pagination in the polling step to capture the full result set.

Bringing Results Together

When all three Rekognition branches and the Transcribe branch complete, the Parallel state collects their outputs into a four-element array. The array ordering matches the branch ordering in the ASL definition. Branch 0 contains content moderation results, Branch 1 contains celebrity results, Branch 2 contains label results, and Branch 3 contains the transcription output.

The aggregation Lambda receives this array, restructures it into a unified metadata document, and writes the document to S3:

{
  "videoKey": "uploads/2026/02/25/user-video-abc123.mp4",
  "analyzedAt": "2026-02-25T14:30:00Z",
  "moderation": {
    "flagged": true,
    "labels": [
      {
        "label": "Violence",
        "confidence": 87.3,
        "timestampMs": 45200
      }
    ]
  },
  "celebrities": [
    {
      "name": "John Smith",
      "confidence": 99.1,
      "timestampMs": 12000,
      "urls": ["https://www.imdb.com/name/nm0000001/"]
    }
  ],
  "labels": {
    "summary": ["Car", "Road", "Person", "Building"],
    "detailed": [ "..." ]
  },
  "transcript": {
    "status": "COMPLETED",
    "outputUri": "s3://my-video-bucket/transcripts/user-video-abc123.json"
  },
  "pipelineExecutionArn": "arn:aws:states:us-east-1:123456789012:execution:VideoModerationPipeline:abc-123"
}

The metadata file is written to a predictable path derived from the original video key: replace the file extension with .metadata.json, or write to a parallel prefix (e.g., metadata/ instead of uploads/). Downstream consumers (moderation dashboards, recommendation engines, search indexers) read the metadata file without needing to invoke any ML services themselves.

Transcription with Amazon Transcribe

Starting a Transcription Job

Amazon Transcribe extracts spoken audio from video files and produces a timestamped transcript. The StartTranscriptionJob API accepts video files directly from S3; no pre-processing or audio extraction step is required. Transcribe handles the audio stream extraction internally.

The Step Functions SDK integration for Transcribe follows the same Request Response pattern as Rekognition:

{
  "StartTranscription": {
    "Type": "Task",
    "Resource": "arn:aws:states:::aws-sdk:transcribe:startTranscriptionJob",
    "Parameters": {
      "TranscriptionJobName.$": "States.Format('moderation-{}', $.key)",
      "LanguageCode": "en-US",
      "Media": {
        "MediaFileUri.$": "States.Format('s3://{}/{}', $.bucket, $.key)"
      },
      "OutputBucketName.$": "$.bucket",
      "OutputKey.$": "States.Format('transcripts/{}.json', $.key)",
      "Settings": {
        "ShowSpeakerLabels": true,
        "MaxSpeakerLabels": 10
      }
    },
    "ResultPath": "$.transcriptionJob",
    "Next": "WaitForTranscription"
  }
}

The States.Format intrinsic function constructs the S3 URI and output key dynamically from the execution input. ShowSpeakerLabels enables speaker diarization, which identifies and labels different speakers in the audio. This is valuable for moderation because it lets you attribute specific statements to specific speakers in a conversation.

Polling for Completion

Transcribe jobs, like Rekognition jobs, run asynchronously. The polling pattern is the same: Wait, Get status, Check, loop or proceed.

{
  "WaitForTranscription": {
    "Type": "Wait",
    "Seconds": 30,
    "Next": "GetTranscriptionStatus"
  },
  "GetTranscriptionStatus": {
    "Type": "Task",
    "Resource": "arn:aws:states:::aws-sdk:transcribe:getTranscriptionJob",
    "Parameters": {
      "TranscriptionJobName.$": "$.transcriptionJob.TranscriptionJob.TranscriptionJobName"
    },
    "ResultPath": "$.transcriptionStatus",
    "Next": "CheckTranscriptionStatus"
  },
  "CheckTranscriptionStatus": {
    "Type": "Choice",
    "Choices": [
      {
        "Variable": "$.transcriptionStatus.TranscriptionJob.TranscriptionJobStatus",
        "StringEquals": "COMPLETED",
        "Next": "TranscriptionComplete"
      },
      {
        "Variable": "$.transcriptionStatus.TranscriptionJob.TranscriptionJobStatus",
        "StringEquals": "FAILED",
        "Next": "TranscriptionFailed"
      }
    ],
    "Default": "WaitForTranscription"
  }
}

Transcription jobs typically complete in 20-50% of the video's duration. A 10-minute video usually finishes transcription within 2-5 minutes. The 30-second polling interval balances responsiveness against API call volume. For very long videos (over one hour), increase the polling interval to 60 seconds to reduce unnecessary GetTranscriptionJob calls.

Retrieving and Storing the Transcript

Unlike Rekognition, Transcribe writes its output directly to S3 at the location specified in OutputBucketName and OutputKey. The transcription branch does not need to retrieve and pass results through the Step Functions execution. It confirms completion and passes the output location forward. The aggregation Lambda reads the transcript from S3 if it needs to extract specific content for the metadata package, or it simply records the transcript location in the metadata document.

Transcribe output includes word-level timestamps, confidence scores per word, punctuation, and speaker labels. This granularity enables transcript-based moderation: flag specific time ranges where profanity, hate speech, or other policy-violating language appears. Combine this with Rekognition Content Moderation timestamps for a complete picture of which segments of the video require human review.

Pricing and Cost Modeling

Per-Service Pricing

Every service in this pipeline bills per unit of content processed. No provisioned capacity, no hourly instance costs, no idle charges.

ServiceUnitPriceFree Tier
Rekognition Video (all APIs)Per minute of video$0.10/min per API60 min/month for 12 months
Amazon Transcribe (batch)Per second of audio (15s minimum)$0.024/min ($0.0004/sec)60 min/month for 12 months
Step Functions (Standard)Per state transition$0.000025/transition4,000 transitions/month
LambdaPer request + duration$0.20/1M requests + $0.0000166667/GB-sec1M requests + 400K GB-sec/month
EventBridgePer event published$1.00/1M eventsFree (custom events are $1/M)
S3 (storage + requests)Per GB stored + per request$0.023/GB + $0.005/1K PUT5 GB for 12 months

The dominant cost is Rekognition Video at $0.10 per minute per API. Running three Rekognition APIs (Content Moderation, Celebrity Recognition, Label Detection) against the same video means $0.30 per minute of video. Transcribe adds $0.024 per minute. Step Functions, Lambda, EventBridge, and S3 request costs are negligible at typical volumes.

Cost Model for a Typical Pipeline Run

Video DurationRekognition (3 APIs)TranscribeStep Functions (~40 transitions)Total per Video
1 minute$0.30$0.024$0.001$0.33
5 minutes$1.50$0.12$0.001$1.62
10 minutes$3.00$0.24$0.001$3.24
30 minutes$9.00$0.72$0.001$9.72
60 minutes$18.00$1.44$0.001$19.44

At scale, cost accumulates quickly. Processing 1,000 ten-minute videos per day costs approximately $3,240 per day ($97,200 per month) in Rekognition and Transcribe charges alone. For strategies on managing S3 storage costs as metadata and transcripts accumulate, see AWS S3 Cost Optimization: The Complete Savings Playbook.

Optimization Strategies

Several techniques reduce per-video cost without sacrificing coverage:

Selective API invocation. Not every video needs all four analysis types. Add a classification step at the beginning of the workflow that checks video metadata (uploader trust level, content category, file size) and routes to a reduced set of APIs. Trusted uploaders with established track records might skip Celebrity Recognition. Short clips under 15 seconds might skip Transcribe.

Frame sampling for Rekognition. Rekognition Video analyzes frames at a sampling rate it determines internally. For Label Detection, where you need broad categorization rather than frame-accurate tracking, consider extracting keyframes with a lightweight Lambda + FFmpeg step and running Rekognition Image APIs ($0.001 per image for Label Detection) instead of the Video API. For a 10-minute video with one keyframe per second, that is 600 images at $0.60 total vs. $1.00 for the Video API. This approach reduces cost by 40% for Label Detection specifically.

Confidence threshold tuning. Higher MinConfidence values reduce the volume of results stored and processed downstream. For Content Moderation, a threshold of 80 dramatically reduces false positives at the cost of a small increase in missed detections. Tune this based on your platform's risk tolerance and human review capacity.

Transcribe language detection. If your platform handles multilingual content, enable automatic language detection in Transcribe rather than running separate jobs for each possible language. This avoids wasted Transcribe minutes on incorrect language guesses.

Operational Considerations

Failure Modes

Production pipelines fail. The question is whether failures are observable, recoverable, and contained. Here are the failure modes I have encountered in production:

Failure ModeSymptomMitigation
Rekognition throttlingProvisionedThroughputExceededException on Start callsExponential backoff retry in ASL (3 retries, 2x backoff, 1s initial interval)
Video format unsupportedRekognition returns InvalidParameterExceptionValidate format (H.264 in MP4/MOV) before starting analysis; add a format-check state
Video exceeds 10 GBStart call rejectedCheck $.size from EventBridge input and route oversized files to a separate processing path or rejection state
Transcribe job name collisionConflictException on StartTranscriptionJobAppend a unique suffix (execution ID or UUID) to the job name
S3 write failure on metadataLambda error on PutObjectRetry the aggregation Lambda with a Catch block; ensure IAM permissions include s3:PutObject
Parallel branch timeoutOne branch runs far longer than expectedSet TimeoutSeconds on each branch's polling loop (e.g., 3600 seconds for one-hour max)
Rekognition job stuck IN_PROGRESSPolling loop never exitsCombine TimeoutSeconds on the polling states with a maximum iteration count in a Choice state

Add retry configuration to every Task state that makes an AWS API call:

{
  "Retry": [
    {
      "ErrorEquals": ["Rekognition.ThrottlingException", "Rekognition.ProvisionedThroughputExceededException"],
      "IntervalSeconds": 2,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    },
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 5,
      "MaxAttempts": 2,
      "BackoffRate": 2.0
    }
  ]
}

For a broader treatment of error-handling patterns in Step Functions workflows, see Step Functions for Cart and Fulfillment: Async Workflow Patterns That Survive Production.

Monitoring and Alerting

Step Functions provides execution-level visibility out of the box. Every state transition, every input/output payload, every error is recorded in the execution history. For pipeline-level monitoring, track these CloudWatch metrics:

MetricSourceAlert Threshold
ExecutionsFailedStep FunctionsAny failure (threshold: 1)
ExecutionsTimedOutStep FunctionsAny timeout (threshold: 1)
ExecutionThrottledStep FunctionsSustained throttling (threshold: 5 in 5 minutes)
ExecutionTimeStep FunctionsP99 execution time exceeds 2x expected duration
Errors (Lambda aggregation function)LambdaAny invocation error
Custom metric: videos processedLambda/CloudWatchDrop below baseline daily volume

Publish a custom CloudWatch metric from the aggregation Lambda that records whether content moderation flagged the video. Track the flag rate over time. A sudden spike in the flag rate may indicate a coordinated abuse campaign. A sudden drop may indicate a Rekognition model issue or a confidence threshold misconfiguration.

Scaling Limits and Throttling

The binding constraint on pipeline throughput is Rekognition Video's concurrent job limit: 20 concurrent jobs per account per region by default. Each Rekognition API call against a video counts as one job. A single pipeline execution that calls three Rekognition APIs consumes three concurrent job slots. With the default limit, you can run approximately six pipeline executions concurrently before hitting the ceiling.

ResourceDefault LimitAdjustableRequest Method
Rekognition concurrent video jobs20 per accountYesService Quotas console
Rekognition video size10 GBNoFixed
Rekognition video duration6 hoursNoFixed
Transcribe concurrent batch jobs250 per accountYesService Quotas console
Step Functions concurrent executionsUnlimited (Standard)N/AN/A
Step Functions StartExecution TPS2,000YesService Quotas console

Request a limit increase for Rekognition concurrent video jobs before going to production. For pipelines processing hundreds of videos per hour, a limit of 100-200 concurrent jobs is typical. AWS grants these increases readily since the limit exists primarily to prevent accidental runaway costs rather than to protect service capacity.

If you hit the concurrent job limit, Step Functions executions will fail at the StartContentModeration (or similar) Task state with a LimitExceededException. The retry configuration catches this and backs off, but sustained overload causes retries to exhaust and executions to fail. For high-throughput scenarios, add a queue (SQS) between EventBridge and Step Functions that controls the concurrency of pipeline executions. Use a Lambda function that reads from SQS and starts Step Functions executions at a controlled rate, respecting the Rekognition concurrent job limit.

Bringing It All Together

graph TD
    Start([Start]) --> Validate[Validate Input
Check size and format] Validate -->|Valid| Parallel Validate -->|Invalid| Reject[Reject:
Write error to S3] subgraph Parallel [Parallel Analysis] direction TB CM_Start[Start Content
Moderation] --> CM_Wait[Wait 20s] CM_Wait --> CM_Get[Get Content
Moderation] CM_Get --> CM_Check{Job Status?} CM_Check -->|IN_PROGRESS| CM_Wait CM_Check -->|SUCCEEDED| CM_Done([Done]) CM_Check -->|FAILED| CM_Fail([Fail]) CR_Start[Start Celebrity
Recognition] --> CR_Wait[Wait 20s] CR_Wait --> CR_Get[Get Celebrity
Recognition] CR_Get --> CR_Check{Job Status?} CR_Check -->|IN_PROGRESS| CR_Wait CR_Check -->|SUCCEEDED| CR_Done([Done]) CR_Check -->|FAILED| CR_Fail([Fail]) LD_Start[Start Label
Detection] --> LD_Wait[Wait 20s] LD_Wait --> LD_Get[Get Label
Detection] LD_Get --> LD_Check{Job Status?} LD_Check -->|IN_PROGRESS| LD_Wait LD_Check -->|SUCCEEDED| LD_Done([Done]) LD_Check -->|FAILED| LD_Fail([Fail]) TX_Start[Start
Transcription] --> TX_Wait[Wait 30s] TX_Wait --> TX_Get[Get Transcription
Status] TX_Get --> TX_Check{Job Status?} TX_Check -->|IN_PROGRESS| TX_Wait TX_Check -->|COMPLETED| TX_Done([Done]) TX_Check -->|FAILED| TX_Fail([Fail]) end Parallel --> Aggregate[Lambda: Aggregate
Results] Aggregate --> WriteS3[Write Metadata
JSON to S3] WriteS3 --> End([End]) Reject --> End
Complete Step Functions workflow for video content moderation

The complete pipeline, from S3 upload to metadata file, typically completes in 1-5 minutes for a 10-minute video. Rekognition Content Moderation is usually the longest-running branch because it performs frame-level analysis across the full video duration. Celebrity Recognition and Label Detection complete faster because they operate on sampled frames. Transcribe completion time scales roughly linearly with audio duration.

This architecture handles the common case (a video that passes moderation) and the flagged case (a video that contains policy-violating content) identically from an orchestration perspective. The pipeline always runs all analyses and always produces a complete metadata package. The decision about whether to publish, quarantine, or reject the video happens downstream, informed by the metadata. Separating analysis from decision-making keeps the pipeline simple and the business logic in a system designed for policy rules rather than ML orchestration.

The pipeline is deterministic, observable, and cost-transparent. Every execution produces the same metadata structure. Every failure is visible in the Step Functions console with full context. Every dollar of ML cost traces back to a specific video and a specific API call. For content moderation at platform scale, that operational clarity is worth more than any particular ML model's accuracy score.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.