Skip to main content

iOS Telemetry Pipeline with Kinesis, Glue, and Athena

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

Any iOS app with real users generates telemetry. Session starts, feature usage, error events, performance metrics, purchase funnels. Most teams start by shipping all of it to Amplitude or Mixpanel and calling it done. That works for a while. Then the monthly invoice triples, you discover the vendor's data model cannot answer a question your PM asked three days ago, and you realize you are paying somebody else to store your data in a format optimized for their business.

I have deployed the pipeline documented here across several production iOS applications. Pinpoint handles ingestion, Kinesis Data Firehose delivers records reliably to S3, Glue discovers the schema automatically, and Athena gives you full SQL over the raw data. No servers. Scales from zero to billions of events per month. I provide the full infrastructure code in both Terraform and Pulumi so you can pick whichever fits your team.

Why a Dedicated Telemetry Pipeline

The first pushback I always hear: "Amplitude already does this." Sure. It does, at a price, with their query model, and with your data locked in their infrastructure. Once you evaluate cost, flexibility, and data ownership together, the case for owning the pipeline gets hard to argue against.

DimensionThird-Party PlatformAWS Telemetry Pipeline
Data ownershipVendor stores your data; export often limited or delayedYou own the S3 bucket and query anytime with any tool
Query flexibilityConstrained to vendor's UI and query modelFull SQL via Athena; join with any other dataset
Data residencyLimited region selection; dependent on vendor's infrastructureDeploy in any AWS region; meets sovereignty requirements
RetentionOften tiered pricing for longer retentionS3 lifecycle policies; pennies per GB-month
IntegrationAPI/webhook exports; often batched or rate-limitedDirect S3 access; feed into ML pipelines, data lakes, BI tools
Schema controlVendor's event schema with limited customizationYour schema, your structure, your partitioning

The numbers at scale speak for themselves:

Monthly Event VolumeAmplitude (Growth)Mixpanel (Growth)AWS Pipeline (estimated)
10M events~$1,000/mo~$1,100/mo~$25/mo
100M events~$4,500/mo~$5,000/mo~$180/mo
1B events~$20,000+/mo (Enterprise)~$25,000+/mo (Enterprise)~$1,400/mo

Those AWS estimates include Pinpoint, Firehose, S3, Glue crawler runs, and moderate Athena query volume. At a billion events per month, this pipeline costs roughly 15x less. The gap only widens with longer retention windows because S3 storage is pennies compared to what analytics vendors charge for historical data access.

Pipeline Architecture

Standard streaming ingestion pattern. Each component does one thing:

flowchart LR
  A[iOS App
AWS Amplify SDK] --> B[Amazon
Pinpoint]
  B --> C[Kinesis Data
Firehose]
  C --> D[Amazon S3
Partitioned Storage]
  D --> E[AWS Glue
Schema Crawler]
  E --> F[Glue Data
Catalog]
  F --> G[Amazon
Athena]
  D --> G
iOS telemetry pipeline architecture

I chose each component for a specific reason:

ComponentRoleScaling ModelCost Driver
AWS Amplify SDKClient-side event capture and batchingRuns on device; batches events automaticallyFree (client-side)
Amazon PinpointEvent ingestion and fan-outFully managed; scales automatically$0.000001 per event collected
Kinesis Data FirehoseReliable buffered delivery to S3Auto-scales; no shard management$0.029 per GB ingested
Amazon S3Durable, partitioned event storageInfinite scale; pay per GB stored$0.023/GB (Standard), $0.0125/GB (IA)
AWS Glue CrawlerAutomatic schema discovery from S3 dataRuns on schedule; DPU-hour billing~$0.44 per run (minimal DPU)
Glue Data CatalogCentralized metadata and table definitionsManaged; free tier covers most use casesFree for first 1M objects
Amazon AthenaAd-hoc SQL queries over S3 dataServerless; per-query billing$5.00 per TB scanned

There are zero always-on compute resources here. Everything is either event-driven (Pinpoint, Firehose) or invoked on demand (Glue crawler, Athena queries). Process zero events, pay close to nothing. Process a billion events a month, still pay a fraction of what Amplitude would charge. I have run this pipeline on a side project that generated maybe 500 events per day and the bill rounded to zero for months.

Data Flow: From Tap to Query

Follow a single telemetry event from the user tapping a button all the way to an analyst running a query. This walkthrough shows why each component exists and where the handoffs happen:

sequenceDiagram
  participant App as iOS App
  participant SDK as Amplify SDK
  participant PP as Pinpoint
  participant FH as Firehose
  participant S3 as S3 Bucket
  participant GC as Glue Crawler
  participant Cat as Glue Catalog
  participant Ath as Athena

  App->>SDK: recordEvent(type, attributes)
  SDK->>SDK: Batch events locally
  SDK->>PP: submitEvents(batch)
  PP->>FH: PutRecord (event stream)
  FH->>FH: Buffer (size/time threshold)
  FH->>S3: PUT object (GZIP compressed)
  Note over S3: data/year=2026/month=02/day=22/
  GC->>S3: List new partitions
  GC->>Cat: Update table schema
  Ath->>Cat: Get table metadata
  Ath->>S3: Scan partitioned data
  Ath-->>Ath: Return query results
Telemetry event lifecycle from device to query

Step 1: Client-side event capture. The iOS app uses the AWS Amplify Analytics SDK to record events. The SDK batches locally, retries on failure, and queues events when the device is offline. Batching keeps network overhead low and avoids hammering the user's battery.

Step 2: Pinpoint ingestion. Pinpoint receives the event batches and tacks on metadata: application ID, client context, session information. Then it forwards each event to the Kinesis Data Firehose delivery stream you configure in the event stream settings.

Step 3: Firehose buffering and delivery. Firehose accumulates incoming records until either the buffer size (default 5 MB) or the buffer interval (default 300 seconds) trips, whichever happens first. It GZIP-compresses the batch and writes one object to S3 using a Hive-style partitioned prefix.

Step 4: S3 partitioned storage. Objects land in S3 with keys like data/year=2026/month=02/day=22/firehose-telemetry-1-2026-02-22-14-30-00-abc123.gz. This partitioning scheme is what makes Athena queries both fast and cheap. Skip it and you will regret it within a month.

Step 5: Glue schema discovery. A Glue crawler runs on schedule (every 6 hours in my configuration), picks up new partitions, infers the JSON schema from the data, and updates the Glue Data Catalog.

Step 6: Athena queries. Analysts run standard SQL through Athena. Automated reports do the same. Athena pulls table metadata from the Glue Catalog and reads directly from S3, scanning only the partitions your WHERE clause specifies.

S3 Partitioning Strategy

Partitioning is the single most impactful design decision in the entire pipeline. Full stop. Athena charges $5 per TB scanned. A query that scans 100 GB costs $0.50. Partition properly and that same query touches 1 GB: $0.005. I learned this the expensive way on an early project where I skipped partitioning to "move fast" and ended up with a $400 Athena bill in the first month.

The Firehose delivery stream uses Hive-style partitioning with this prefix pattern:

data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/

That produces S3 keys Athena recognizes as partition columns:

s3://myapp-dev-telemetry/
├── data/
│   ├── year=2026/
│   │   ├── month=01/
│   │   │   ├── day=15/
│   │   │   │   ├── telemetry-1-2026-01-15-00-05-00-abc123.gz
│   │   │   │   ├── telemetry-1-2026-01-15-00-10-00-def456.gz
│   │   │   │   └── ...
│   │   │   ├── day=16/
│   │   │   └── ...
│   │   ├── month=02/
│   │   └── ...
│   └── ...
└── errors/
    └── ...

I tier the data with S3 lifecycle policies based on how often it gets queried:

Storage TierDays After CreationCost per GB-MonthUse Case
S3 Standard0-30$0.023Active analysis, recent events
S3 Standard-IA30-90$0.0125Historical lookback, trend analysis
Expired90+$0.000Data deleted per retention policy

All of these thresholds are configurable. In production deployments with compliance requirements, I extend expiration to 365 days or longer and add a Glacier Deep Archive tier at 180 days for audit retention.

Kinesis Data Firehose Configuration

I use Firehose here instead of Kinesis Data Streams. Firehose requires no shard capacity planning, no consumer code, and no scaling logic. Data Streams gives you lower latency and consumer fan-out, but for telemetry where a few minutes of lag is fine, Firehose removes an entire category of operational work. I have operated Data Streams pipelines in other contexts and the shard management alone justified switching to Firehose for analytics use cases.

ParameterValueTradeoff
Buffer size5 MB (configurable 1-128 MB)Larger buffers = fewer S3 objects = lower S3 request costs, but higher delivery latency
Buffer interval300 seconds (configurable 60-900s)Shorter intervals = lower latency but more small files; longer intervals = better compression ratios
CompressionGZIP70-85% compression ratio for JSON telemetry; Athena reads GZIP natively
Error handlingSeparate error prefix in same bucketFailed records land in errors/ with diagnostic metadata for debugging
Note
Firehose flushes when either the buffer size or interval threshold is reached, whichever comes first. During high-volume periods, the size threshold triggers more frequently, producing well-compressed files. During low-volume periods, the interval threshold ensures data is delivered within a predictable window. This adaptive behavior is why Firehose works well from prototype scale to production scale without reconfiguration.

You need the extended_s3 destination type specifically. The basic s3 destination only supports a static prefix, which dumps all data into one directory and forces Athena to scan everything regardless of the time range in your query. I have seen teams make this mistake and wonder why their Athena bills are ten times what they expected.

AWS Glue: Schema Discovery and Cataloging

Glue does two things here: infer the schema and register partitions. The crawler reads sample files from S3, figures out the JSON structure, registers new date partitions as they appear, and keeps everything updated in the Glue Data Catalog. Athena uses that catalog as its metastore.

Crawler SettingValueWhy
ScheduleEvery 6 hours (cron(0 */6 * * ? *))Balances partition freshness against crawler cost
Recrawl behaviorCRAWL_NEW_FOLDERS_ONLYOnly scans new partitions; avoids re-reading historical data
Schema change policyUPDATE_IN_DATABASE / LOGEvolves schema forward (new columns added); never deletes columns
GroupingCombineCompatibleSchemasMerges slightly different schemas across partitions into a single table

Crawler schedule is a cost/freshness tradeoff:

ScheduleMonthly Cost (~)Partition Delay
Every hour~$13/moUp to 1 hour
Every 6 hours~$2.20/moUp to 6 hours
Every 12 hours~$1.10/moUp to 12 hours
Daily~$0.55/moUp to 24 hours

Six hours between data landing in S3 and being queryable in Athena is fine for most analytics work. If you absolutely need near-real-time access, add a Lambda triggered by S3 events that calls batch_create_partition to register partitions immediately. I have done this on one project where the product team wanted a live dashboard. It works, but it adds moving parts that are rarely worth it for batch analytics.

Schema evolution just works. Ship a new iOS build with additional event attributes and the crawler picks up the new columns on its next run. Older data returns NULL for those columns. No migrations, no backfills. I have gone through a dozen schema changes on a single pipeline and never had to touch the Glue configuration.

Athena: Querying Telemetry Data

Athena gives you serverless SQL directly against the S3 data. The workgroup configuration is where you enforce cost controls and keep everyone honest:

Workgroup SettingValuePurpose
Enforce configurationtruePrevents users from overriding output location or encryption
Bytes scanned cutoff1 GBKills queries that would scan more than 1 GB (cost protection)
Result encryptionSSE-KMSEncrypts query results at rest
CloudWatch metricsEnabledTracks query counts, data scanned, and execution times

That byte scan cutoff matters more than you think. One careless query without partition filters scans the entire dataset. At $5/TB, a full scan of 100 GB costs $0.50. Sounds small. Now multiply that by a team of analysts running queries all day. The 1 GB cutoff caps any single query at roughly $0.005 and trains analysts to include partition filters fast. People learn when their queries fail.

Some example queries showing how to use partitioning correctly:

-- Daily event counts for the past week (scans ~7 partitions)
SELECT year, month, day, COUNT(*) as event_count
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day >= '15'
GROUP BY year, month, day
ORDER BY day DESC;

-- Top event types for a specific day (scans 1 partition)
SELECT event_type, COUNT(*) as occurrences
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day = '22'
GROUP BY event_type
ORDER BY occurrences DESC
LIMIT 20;

-- Session duration analysis (scans 1 month of data)
SELECT
  DATE(from_iso8601_timestamp(event_timestamp)) as event_date,
  AVG(session_duration) as avg_session_seconds,
  APPROX_PERCENTILE(session_duration, 0.95) as p95_session_seconds
FROM telemetry_db.data
WHERE year = '2026' AND month = '02'
  AND event_type = '_session.stop'
GROUP BY DATE(from_iso8601_timestamp(event_timestamp))
ORDER BY event_date;

Notice every query includes year, month, and day in the WHERE clause. Make this a habit. It keeps your Athena bill predictable and your queries fast.

IAM Architecture

Three IAM roles, each scoped to the bare minimum for its integration point. No wildcards. Each trust policy locks down to the specific AWS service that assumes the role:

flowchart TD
  subgraph Trust Policies
    PP_SVC[pinpoint.amazonaws.com] -->|AssumeRole| R1[Pinpoint → Firehose
Role]
    FH_SVC[firehose.amazonaws.com] -->|AssumeRole| R2[Firehose → S3
Role]
    GL_SVC[glue.amazonaws.com] -->|AssumeRole| R3[Glue Crawler
Role]
  end

  subgraph Permissions
    R1 -->|PutRecord
PutRecordBatch| FH[Firehose
Delivery Stream]
    R2 -->|PutObject
GetObject
ListBucket| S3[S3 Telemetry
Bucket]
    R3 -->|GetObject
ListBucket| S3
    R3 -->|AWSGlueServiceRole| GLUE[Glue Service
Permissions]
  end
IAM role architecture with trust and permission relationships
RoleTrust PrincipalPermissionsScope
Pinpoint → Firehosepinpoint.amazonaws.comfirehose:PutRecord, firehose:PutRecordBatchSpecific Firehose stream ARN only
Firehose → S3firehose.amazonaws.coms3:PutObject, s3:GetObject, s3:ListBucket, multipart upload actionsSpecific S3 bucket and objects only
Glue Crawlerglue.amazonaws.coms3:GetObject, s3:ListBucket + AWSGlueServiceRole managed policySpecific S3 bucket and objects only

All three roles include aws:SourceAccount conditions in their trust policies (where supported) to block confused deputy attacks. The Firehose role also needs s3:AbortMultipartUpload and s3:ListBucketMultipartUploads because Firehose uses multipart uploads for large buffered deliveries. I have forgotten those permissions before and spent an hour staring at cryptic access denied errors in the Firehose error log.

Cost Estimation

Cost predictability is one of the reasons I keep coming back to this architecture. Every component bills on usage with no minimums:

Component10M Events/Month100M Events/Month1B Events/Month
Pinpoint$10.00$100.00$1,000.00
Firehose (ingestion)$0.87$8.70$87.00
S3 (storage, 90-day retention)$1.50$15.00$150.00
S3 (requests)$0.50$5.00$50.00
Glue Crawler (4x daily)$2.20$2.20$2.20
Athena (moderate queries)$5.00$25.00$75.00
CloudWatch Logs$1.00$5.00$30.00
Total~$21~$161~$1,394

Assumptions: 3 KB average event size (JSON), 80% GZIP compression ratio, Athena scanning 50 GB/month at the 10M tier and scaling proportionally, S3 Standard for the first 30 days then Standard-IA. Pinpoint dominates costs at every tier. If you already have an ingestion layer (API Gateway + Lambda, say), drop Pinpoint and cut costs 40-70%.

Where to cut costs further:

LeverImpactTradeoff
Replace Pinpoint with API Gateway + Lambda40-70% cost reductionMore infrastructure to manage; lose Pinpoint's campaign features
Increase Firehose buffer sizeFewer S3 PUT requestsHigher delivery latency
Shorten S3 retentionLinear storage cost reductionLess historical data available
Add Glacier tier instead of expiration90%+ storage savings for archive dataMinutes-to-hours retrieval time
Use columnar format (Parquet via Firehose transform)50-80% Athena cost reductionRequires Firehose data transformation (Lambda)

Terraform vs. Pulumi: Side-by-Side

I maintain both a Terraform and a Pulumi implementation of this pipeline. They produce identical infrastructure; pick whichever fits your team's workflow. For a deeper comparison of IaC tools, see Infrastructure as Code: CloudFormation, CDK, Terraform, and Pulumi Compared.

AspectTerraformPulumi (Python)
Repositorytf-config-telemetry-pipelinepul-py-config-telemetry-pipeline
LanguageHCL (HashiCorp Configuration Language)Python
File structureOne .tf file per resource groupOne .py module per resource group
State managementTerraform state file (local or remote)Pulumi service or self-managed backend
Variable handlingvariables.tf with type constraintsPulumi.yaml config with Python helpers
Dynamic valuesString interpolation, jsonencode()Output.apply() with lambda functions
IAM policiesInline jsonencode() blocksjson.dumps() inside apply() callbacks
Total lines of code~350~400

The Firehose delivery stream is the most interesting resource to compare because it has complex nested configuration and cross-resource references:

Terraform:

resource "aws_kinesis_firehose_delivery_stream" "telemetry" {
  name        = "${local.name_prefix}-telemetry"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = aws_iam_role.firehose_s3.arn
    bucket_arn = aws_s3_bucket.telemetry.arn
    prefix     = "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"

    buffering_size     = var.firehose_buffer_size_mb
    buffering_interval = var.firehose_buffer_interval_seconds
    compression_format = "GZIP"
  }
}

Pulumi (Python):

firehose = aws.kinesis.FirehoseDeliveryStream(
    "telemetry-firehose",
    name=f"{name_prefix}-telemetry",
    destination="extended_s3",
    extended_s3_configuration={
        "role_arn": firehose_s3_role.arn,
        "bucket_arn": telemetry_bucket.arn,
        "prefix": "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
        "buffering_size": config["firehose_buffer_size_mb"],
        "buffering_interval": config["firehose_buffer_interval_seconds"],
        "compression_format": "GZIP",
    },
)

The structural similarity is intentional; both tools use declarative resource definitions backed by the same AWS provider. Pulumi gives you Python's full type system, IDE support, and real programming constructs for dynamic logic. Terraform's HCL is purpose-built for infrastructure and keeps things simpler for engineers who do not want to think in Python while writing infra. I tend to reach for Pulumi on projects where the infrastructure logic has conditionals and loops, and Terraform when the setup is straightforward.

Common Failure Modes

Every one of these has bitten me in production at least once. Save yourself the debugging time:

Failure ModeSymptomRoot CauseFix
Firehose delivery failuresRecords appear in errors/ prefixIAM role permissions insufficient; bucket policy conflictVerify Firehose role has s3:PutObject on the exact bucket ARN including /* suffix
Glue crawler finds no tablesEmpty database after crawler completesCrawler S3 target path does not match Firehose prefix; missing trailing slashEnsure crawler path is s3://bucket/data/ (with trailing slash) matching Firehose prefix
Athena query scans entire datasetHigh query cost; slow executionQuery missing partition filters (year, month, day in WHERE clause)Always include partition columns in WHERE; use workgroup byte cutoff as safety net
Schema mismatch after app updateNULL values in new columns; query errorsCrawler has not run since new event attributes were deployedRun crawler manually after deploying app changes; or reduce crawler interval
Small file problemThousands of tiny S3 objects per dayBuffer interval too short for the event volumeIncrease buffer interval to 900s and buffer size to 64-128 MB for high-volume streams
Pinpoint event stream lagEvents delayed minutes-to-hoursPinpoint event stream disabled or Firehose throttledCheck Pinpoint event stream configuration; verify Firehose is not in SUSPENDED state
Athena workgroup query failuresQueries fail immediatelyByte scan cutoff too low for the query patternIncrease bytes_scanned_cutoff_per_query; or optimize query to scan fewer partitions
Crawler DPU timeoutCrawler fails to completeToo many small files; schema too complexEnable CRAWL_NEW_FOLDERS_ONLY; use CombineCompatibleSchemas grouping

Key Architectural Recommendations

  1. Always partition by date. This is the one decision that makes or breaks the economics of the whole pipeline. Without Hive-style date partitions, every Athena query scans every byte you have ever stored.
  2. Set Firehose buffer interval to at least 300 seconds. Shorter intervals create swarms of tiny S3 objects. Each one adds GET request overhead during Athena scans. Five minutes balances latency against the small-file problem well enough for analytics.
  3. Use GZIP compression. 70-85% storage reduction. Athena reads GZIP natively. Firehose handles compression; your pipeline sees zero performance impact. There is no reason to skip this.
  4. Configure CRAWL_NEW_FOLDERS_ONLY. I cannot stress this enough. Without it, the Glue crawler re-reads every historical file on every run. On a pipeline with six months of data, that turns a $2/month crawler into a $20/month one for no benefit.
  5. Enforce the Athena workgroup configuration. Set enforce_workgroup_configuration = true. Otherwise someone will accidentally write query results to an unencrypted bucket or blow past the byte scan cutoff.
  6. Scope IAM roles to specific resource ARNs. Never Resource: "*" on Firehose or Glue roles. If an ARN exists, use it. Limits the blast radius when (not if) something goes sideways.
  7. Set up lifecycle policies from day one. Telemetry accumulates faster than you expect. A 90-day retention policy with a 30-day Standard-to-IA transition keeps storage costs from surprising you in month three.
  8. Monitor the Firehose errors/ prefix. Delivery failures are completely silent unless you watch for them. Set up an S3 event notification or a CloudWatch alarm. I once lost two days of telemetry before noticing because I skipped this step on an early deployment.
  9. Start with Firehose, not Data Streams. Shard management is complexity you do not need for analytics telemetry. If sub-second latency or consumer fan-out becomes a requirement later, you can add a Data Stream upstream without redesigning anything.
  10. Codify everything. Terraform or Pulumi, I do not care which. Just do not make Firehose buffer or Glue crawler changes in the console. Manual tweaks drift, and when they cause an incident at 2 AM, nobody remembers what was changed.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.