About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
Any iOS app with real users generates telemetry. Session starts, feature usage, error events, performance metrics, purchase funnels. Most teams start by shipping all of it to Amplitude or Mixpanel and calling it done. That works for a while. Then the monthly invoice triples, you discover the vendor's data model cannot answer a question your PM asked three days ago, and you realize you are paying somebody else to store your data in a format optimized for their business.
I have deployed the pipeline documented here across several production iOS applications. Pinpoint handles ingestion, Kinesis Data Firehose delivers records reliably to S3, Glue discovers the schema automatically, and Athena gives you full SQL over the raw data. No servers. Scales from zero to billions of events per month. I provide the full infrastructure code in both Terraform and Pulumi so you can pick whichever fits your team.
Why a Dedicated Telemetry Pipeline
The first pushback I always hear: "Amplitude already does this." Sure. It does, at a price, with their query model, and with your data locked in their infrastructure. Once you evaluate cost, flexibility, and data ownership together, the case for owning the pipeline gets hard to argue against.
| Dimension | Third-Party Platform | AWS Telemetry Pipeline |
|---|---|---|
| Data ownership | Vendor stores your data; export often limited or delayed | You own the S3 bucket and query anytime with any tool |
| Query flexibility | Constrained to vendor's UI and query model | Full SQL via Athena; join with any other dataset |
| Data residency | Limited region selection; dependent on vendor's infrastructure | Deploy in any AWS region; meets sovereignty requirements |
| Retention | Often tiered pricing for longer retention | S3 lifecycle policies; pennies per GB-month |
| Integration | API/webhook exports; often batched or rate-limited | Direct S3 access; feed into ML pipelines, data lakes, BI tools |
| Schema control | Vendor's event schema with limited customization | Your schema, your structure, your partitioning |
The numbers at scale speak for themselves:
| Monthly Event Volume | Amplitude (Growth) | Mixpanel (Growth) | AWS Pipeline (estimated) |
|---|---|---|---|
| 10M events | ~$1,000/mo | ~$1,100/mo | ~$25/mo |
| 100M events | ~$4,500/mo | ~$5,000/mo | ~$180/mo |
| 1B events | ~$20,000+/mo (Enterprise) | ~$25,000+/mo (Enterprise) | ~$1,400/mo |
Those AWS estimates include Pinpoint, Firehose, S3, Glue crawler runs, and moderate Athena query volume. At a billion events per month, this pipeline costs roughly 15x less. The gap only widens with longer retention windows because S3 storage is pennies compared to what analytics vendors charge for historical data access.
Pipeline Architecture
Standard streaming ingestion pattern. Each component does one thing:
flowchart LR A[iOS App AWS Amplify SDK] --> B[Amazon Pinpoint] B --> C[Kinesis Data Firehose] C --> D[Amazon S3 Partitioned Storage] D --> E[AWS Glue Schema Crawler] E --> F[Glue Data Catalog] F --> G[Amazon Athena] D --> G
I chose each component for a specific reason:
| Component | Role | Scaling Model | Cost Driver |
|---|---|---|---|
| AWS Amplify SDK | Client-side event capture and batching | Runs on device; batches events automatically | Free (client-side) |
| Amazon Pinpoint | Event ingestion and fan-out | Fully managed; scales automatically | $0.000001 per event collected |
| Kinesis Data Firehose | Reliable buffered delivery to S3 | Auto-scales; no shard management | $0.029 per GB ingested |
| Amazon S3 | Durable, partitioned event storage | Infinite scale; pay per GB stored | $0.023/GB (Standard), $0.0125/GB (IA) |
| AWS Glue Crawler | Automatic schema discovery from S3 data | Runs on schedule; DPU-hour billing | ~$0.44 per run (minimal DPU) |
| Glue Data Catalog | Centralized metadata and table definitions | Managed; free tier covers most use cases | Free for first 1M objects |
| Amazon Athena | Ad-hoc SQL queries over S3 data | Serverless; per-query billing | $5.00 per TB scanned |
There are zero always-on compute resources here. Everything is either event-driven (Pinpoint, Firehose) or invoked on demand (Glue crawler, Athena queries). Process zero events, pay close to nothing. Process a billion events a month, still pay a fraction of what Amplitude would charge. I have run this pipeline on a side project that generated maybe 500 events per day and the bill rounded to zero for months.
Data Flow: From Tap to Query
Follow a single telemetry event from the user tapping a button all the way to an analyst running a query. This walkthrough shows why each component exists and where the handoffs happen:
sequenceDiagram participant App as iOS App participant SDK as Amplify SDK participant PP as Pinpoint participant FH as Firehose participant S3 as S3 Bucket participant GC as Glue Crawler participant Cat as Glue Catalog participant Ath as Athena App->>SDK: recordEvent(type, attributes) SDK->>SDK: Batch events locally SDK->>PP: submitEvents(batch) PP->>FH: PutRecord (event stream) FH->>FH: Buffer (size/time threshold) FH->>S3: PUT object (GZIP compressed) Note over S3: data/year=2026/month=02/day=22/ GC->>S3: List new partitions GC->>Cat: Update table schema Ath->>Cat: Get table metadata Ath->>S3: Scan partitioned data Ath-->>Ath: Return query results
Step 1: Client-side event capture. The iOS app uses the AWS Amplify Analytics SDK to record events. The SDK batches locally, retries on failure, and queues events when the device is offline. Batching keeps network overhead low and avoids hammering the user's battery.
Step 2: Pinpoint ingestion. Pinpoint receives the event batches and tacks on metadata: application ID, client context, session information. Then it forwards each event to the Kinesis Data Firehose delivery stream you configure in the event stream settings.
Step 3: Firehose buffering and delivery. Firehose accumulates incoming records until either the buffer size (default 5 MB) or the buffer interval (default 300 seconds) trips, whichever happens first. It GZIP-compresses the batch and writes one object to S3 using a Hive-style partitioned prefix.
Step 4: S3 partitioned storage. Objects land in S3 with keys like data/year=2026/month=02/day=22/firehose-telemetry-1-2026-02-22-14-30-00-abc123.gz. This partitioning scheme is what makes Athena queries both fast and cheap. Skip it and you will regret it within a month.
Step 5: Glue schema discovery. A Glue crawler runs on schedule (every 6 hours in my configuration), picks up new partitions, infers the JSON schema from the data, and updates the Glue Data Catalog.
Step 6: Athena queries. Analysts run standard SQL through Athena. Automated reports do the same. Athena pulls table metadata from the Glue Catalog and reads directly from S3, scanning only the partitions your WHERE clause specifies.
S3 Partitioning Strategy
Partitioning is the single most impactful design decision in the entire pipeline. Full stop. Athena charges $5 per TB scanned. A query that scans 100 GB costs $0.50. Partition properly and that same query touches 1 GB: $0.005. I learned this the expensive way on an early project where I skipped partitioning to "move fast" and ended up with a $400 Athena bill in the first month.
The Firehose delivery stream uses Hive-style partitioning with this prefix pattern:
data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/
That produces S3 keys Athena recognizes as partition columns:
s3://myapp-dev-telemetry/
├── data/
│ ├── year=2026/
│ │ ├── month=01/
│ │ │ ├── day=15/
│ │ │ │ ├── telemetry-1-2026-01-15-00-05-00-abc123.gz
│ │ │ │ ├── telemetry-1-2026-01-15-00-10-00-def456.gz
│ │ │ │ └── ...
│ │ │ ├── day=16/
│ │ │ └── ...
│ │ ├── month=02/
│ │ └── ...
│ └── ...
└── errors/
└── ...
I tier the data with S3 lifecycle policies based on how often it gets queried:
| Storage Tier | Days After Creation | Cost per GB-Month | Use Case |
|---|---|---|---|
| S3 Standard | 0-30 | $0.023 | Active analysis, recent events |
| S3 Standard-IA | 30-90 | $0.0125 | Historical lookback, trend analysis |
| Expired | 90+ | $0.000 | Data deleted per retention policy |
All of these thresholds are configurable. In production deployments with compliance requirements, I extend expiration to 365 days or longer and add a Glacier Deep Archive tier at 180 days for audit retention.
Kinesis Data Firehose Configuration
I use Firehose here instead of Kinesis Data Streams. Firehose requires no shard capacity planning, no consumer code, and no scaling logic. Data Streams gives you lower latency and consumer fan-out, but for telemetry where a few minutes of lag is fine, Firehose removes an entire category of operational work. I have operated Data Streams pipelines in other contexts and the shard management alone justified switching to Firehose for analytics use cases.
| Parameter | Value | Tradeoff |
|---|---|---|
| Buffer size | 5 MB (configurable 1-128 MB) | Larger buffers = fewer S3 objects = lower S3 request costs, but higher delivery latency |
| Buffer interval | 300 seconds (configurable 60-900s) | Shorter intervals = lower latency but more small files; longer intervals = better compression ratios |
| Compression | GZIP | 70-85% compression ratio for JSON telemetry; Athena reads GZIP natively |
| Error handling | Separate error prefix in same bucket | Failed records land in errors/ with diagnostic metadata for debugging |
You need the extended_s3 destination type specifically. The basic s3 destination only supports a static prefix, which dumps all data into one directory and forces Athena to scan everything regardless of the time range in your query. I have seen teams make this mistake and wonder why their Athena bills are ten times what they expected.
AWS Glue: Schema Discovery and Cataloging
Glue does two things here: infer the schema and register partitions. The crawler reads sample files from S3, figures out the JSON structure, registers new date partitions as they appear, and keeps everything updated in the Glue Data Catalog. Athena uses that catalog as its metastore.
| Crawler Setting | Value | Why |
|---|---|---|
| Schedule | Every 6 hours (cron(0 */6 * * ? *)) | Balances partition freshness against crawler cost |
| Recrawl behavior | CRAWL_NEW_FOLDERS_ONLY | Only scans new partitions; avoids re-reading historical data |
| Schema change policy | UPDATE_IN_DATABASE / LOG | Evolves schema forward (new columns added); never deletes columns |
| Grouping | CombineCompatibleSchemas | Merges slightly different schemas across partitions into a single table |
Crawler schedule is a cost/freshness tradeoff:
| Schedule | Monthly Cost (~) | Partition Delay |
|---|---|---|
| Every hour | ~$13/mo | Up to 1 hour |
| Every 6 hours | ~$2.20/mo | Up to 6 hours |
| Every 12 hours | ~$1.10/mo | Up to 12 hours |
| Daily | ~$0.55/mo | Up to 24 hours |
Six hours between data landing in S3 and being queryable in Athena is fine for most analytics work. If you absolutely need near-real-time access, add a Lambda triggered by S3 events that calls batch_create_partition to register partitions immediately. I have done this on one project where the product team wanted a live dashboard. It works, but it adds moving parts that are rarely worth it for batch analytics.
Schema evolution just works. Ship a new iOS build with additional event attributes and the crawler picks up the new columns on its next run. Older data returns NULL for those columns. No migrations, no backfills. I have gone through a dozen schema changes on a single pipeline and never had to touch the Glue configuration.
Athena: Querying Telemetry Data
Athena gives you serverless SQL directly against the S3 data. The workgroup configuration is where you enforce cost controls and keep everyone honest:
| Workgroup Setting | Value | Purpose |
|---|---|---|
| Enforce configuration | true | Prevents users from overriding output location or encryption |
| Bytes scanned cutoff | 1 GB | Kills queries that would scan more than 1 GB (cost protection) |
| Result encryption | SSE-KMS | Encrypts query results at rest |
| CloudWatch metrics | Enabled | Tracks query counts, data scanned, and execution times |
That byte scan cutoff matters more than you think. One careless query without partition filters scans the entire dataset. At $5/TB, a full scan of 100 GB costs $0.50. Sounds small. Now multiply that by a team of analysts running queries all day. The 1 GB cutoff caps any single query at roughly $0.005 and trains analysts to include partition filters fast. People learn when their queries fail.
Some example queries showing how to use partitioning correctly:
-- Daily event counts for the past week (scans ~7 partitions)
SELECT year, month, day, COUNT(*) as event_count
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day >= '15'
GROUP BY year, month, day
ORDER BY day DESC;
-- Top event types for a specific day (scans 1 partition)
SELECT event_type, COUNT(*) as occurrences
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day = '22'
GROUP BY event_type
ORDER BY occurrences DESC
LIMIT 20;
-- Session duration analysis (scans 1 month of data)
SELECT
DATE(from_iso8601_timestamp(event_timestamp)) as event_date,
AVG(session_duration) as avg_session_seconds,
APPROX_PERCENTILE(session_duration, 0.95) as p95_session_seconds
FROM telemetry_db.data
WHERE year = '2026' AND month = '02'
AND event_type = '_session.stop'
GROUP BY DATE(from_iso8601_timestamp(event_timestamp))
ORDER BY event_date;
Notice every query includes year, month, and day in the WHERE clause. Make this a habit. It keeps your Athena bill predictable and your queries fast.
IAM Architecture
Three IAM roles, each scoped to the bare minimum for its integration point. No wildcards. Each trust policy locks down to the specific AWS service that assumes the role:
flowchart TD
subgraph Trust Policies
PP_SVC[pinpoint.amazonaws.com] -->|AssumeRole| R1[Pinpoint → Firehose
Role]
FH_SVC[firehose.amazonaws.com] -->|AssumeRole| R2[Firehose → S3
Role]
GL_SVC[glue.amazonaws.com] -->|AssumeRole| R3[Glue Crawler
Role]
end
subgraph Permissions
R1 -->|PutRecord
PutRecordBatch| FH[Firehose
Delivery Stream]
R2 -->|PutObject
GetObject
ListBucket| S3[S3 Telemetry
Bucket]
R3 -->|GetObject
ListBucket| S3
R3 -->|AWSGlueServiceRole| GLUE[Glue Service
Permissions]
end | Role | Trust Principal | Permissions | Scope |
|---|---|---|---|
| Pinpoint → Firehose | pinpoint.amazonaws.com | firehose:PutRecord, firehose:PutRecordBatch | Specific Firehose stream ARN only |
| Firehose → S3 | firehose.amazonaws.com | s3:PutObject, s3:GetObject, s3:ListBucket, multipart upload actions | Specific S3 bucket and objects only |
| Glue Crawler | glue.amazonaws.com | s3:GetObject, s3:ListBucket + AWSGlueServiceRole managed policy | Specific S3 bucket and objects only |
All three roles include aws:SourceAccount conditions in their trust policies (where supported) to block confused deputy attacks. The Firehose role also needs s3:AbortMultipartUpload and s3:ListBucketMultipartUploads because Firehose uses multipart uploads for large buffered deliveries. I have forgotten those permissions before and spent an hour staring at cryptic access denied errors in the Firehose error log.
Cost Estimation
Cost predictability is one of the reasons I keep coming back to this architecture. Every component bills on usage with no minimums:
| Component | 10M Events/Month | 100M Events/Month | 1B Events/Month |
|---|---|---|---|
| Pinpoint | $10.00 | $100.00 | $1,000.00 |
| Firehose (ingestion) | $0.87 | $8.70 | $87.00 |
| S3 (storage, 90-day retention) | $1.50 | $15.00 | $150.00 |
| S3 (requests) | $0.50 | $5.00 | $50.00 |
| Glue Crawler (4x daily) | $2.20 | $2.20 | $2.20 |
| Athena (moderate queries) | $5.00 | $25.00 | $75.00 |
| CloudWatch Logs | $1.00 | $5.00 | $30.00 |
| Total | ~$21 | ~$161 | ~$1,394 |
Assumptions: 3 KB average event size (JSON), 80% GZIP compression ratio, Athena scanning 50 GB/month at the 10M tier and scaling proportionally, S3 Standard for the first 30 days then Standard-IA. Pinpoint dominates costs at every tier. If you already have an ingestion layer (API Gateway + Lambda, say), drop Pinpoint and cut costs 40-70%.
Where to cut costs further:
| Lever | Impact | Tradeoff |
|---|---|---|
| Replace Pinpoint with API Gateway + Lambda | 40-70% cost reduction | More infrastructure to manage; lose Pinpoint's campaign features |
| Increase Firehose buffer size | Fewer S3 PUT requests | Higher delivery latency |
| Shorten S3 retention | Linear storage cost reduction | Less historical data available |
| Add Glacier tier instead of expiration | 90%+ storage savings for archive data | Minutes-to-hours retrieval time |
| Use columnar format (Parquet via Firehose transform) | 50-80% Athena cost reduction | Requires Firehose data transformation (Lambda) |
Terraform vs. Pulumi: Side-by-Side
I maintain both a Terraform and a Pulumi implementation of this pipeline. They produce identical infrastructure; pick whichever fits your team's workflow. For a deeper comparison of IaC tools, see Infrastructure as Code: CloudFormation, CDK, Terraform, and Pulumi Compared.
| Aspect | Terraform | Pulumi (Python) |
|---|---|---|
| Repository | tf-config-telemetry-pipeline | pul-py-config-telemetry-pipeline |
| Language | HCL (HashiCorp Configuration Language) | Python |
| File structure | One .tf file per resource group | One .py module per resource group |
| State management | Terraform state file (local or remote) | Pulumi service or self-managed backend |
| Variable handling | variables.tf with type constraints | Pulumi.yaml config with Python helpers |
| Dynamic values | String interpolation, jsonencode() | Output.apply() with lambda functions |
| IAM policies | Inline jsonencode() blocks | json.dumps() inside apply() callbacks |
| Total lines of code | ~350 | ~400 |
The Firehose delivery stream is the most interesting resource to compare because it has complex nested configuration and cross-resource references:
Terraform:
resource "aws_kinesis_firehose_delivery_stream" "telemetry" {
name = "${local.name_prefix}-telemetry"
destination = "extended_s3"
extended_s3_configuration {
role_arn = aws_iam_role.firehose_s3.arn
bucket_arn = aws_s3_bucket.telemetry.arn
prefix = "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
buffering_size = var.firehose_buffer_size_mb
buffering_interval = var.firehose_buffer_interval_seconds
compression_format = "GZIP"
}
}
Pulumi (Python):
firehose = aws.kinesis.FirehoseDeliveryStream(
"telemetry-firehose",
name=f"{name_prefix}-telemetry",
destination="extended_s3",
extended_s3_configuration={
"role_arn": firehose_s3_role.arn,
"bucket_arn": telemetry_bucket.arn,
"prefix": "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
"buffering_size": config["firehose_buffer_size_mb"],
"buffering_interval": config["firehose_buffer_interval_seconds"],
"compression_format": "GZIP",
},
)
The structural similarity is intentional; both tools use declarative resource definitions backed by the same AWS provider. Pulumi gives you Python's full type system, IDE support, and real programming constructs for dynamic logic. Terraform's HCL is purpose-built for infrastructure and keeps things simpler for engineers who do not want to think in Python while writing infra. I tend to reach for Pulumi on projects where the infrastructure logic has conditionals and loops, and Terraform when the setup is straightforward.
Common Failure Modes
Every one of these has bitten me in production at least once. Save yourself the debugging time:
| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| Firehose delivery failures | Records appear in errors/ prefix | IAM role permissions insufficient; bucket policy conflict | Verify Firehose role has s3:PutObject on the exact bucket ARN including /* suffix |
| Glue crawler finds no tables | Empty database after crawler completes | Crawler S3 target path does not match Firehose prefix; missing trailing slash | Ensure crawler path is s3://bucket/data/ (with trailing slash) matching Firehose prefix |
| Athena query scans entire dataset | High query cost; slow execution | Query missing partition filters (year, month, day in WHERE clause) | Always include partition columns in WHERE; use workgroup byte cutoff as safety net |
| Schema mismatch after app update | NULL values in new columns; query errors | Crawler has not run since new event attributes were deployed | Run crawler manually after deploying app changes; or reduce crawler interval |
| Small file problem | Thousands of tiny S3 objects per day | Buffer interval too short for the event volume | Increase buffer interval to 900s and buffer size to 64-128 MB for high-volume streams |
| Pinpoint event stream lag | Events delayed minutes-to-hours | Pinpoint event stream disabled or Firehose throttled | Check Pinpoint event stream configuration; verify Firehose is not in SUSPENDED state |
| Athena workgroup query failures | Queries fail immediately | Byte scan cutoff too low for the query pattern | Increase bytes_scanned_cutoff_per_query; or optimize query to scan fewer partitions |
| Crawler DPU timeout | Crawler fails to complete | Too many small files; schema too complex | Enable CRAWL_NEW_FOLDERS_ONLY; use CombineCompatibleSchemas grouping |
Key Architectural Recommendations
- Always partition by date. This is the one decision that makes or breaks the economics of the whole pipeline. Without Hive-style date partitions, every Athena query scans every byte you have ever stored.
- Set Firehose buffer interval to at least 300 seconds. Shorter intervals create swarms of tiny S3 objects. Each one adds GET request overhead during Athena scans. Five minutes balances latency against the small-file problem well enough for analytics.
- Use GZIP compression. 70-85% storage reduction. Athena reads GZIP natively. Firehose handles compression; your pipeline sees zero performance impact. There is no reason to skip this.
- Configure
CRAWL_NEW_FOLDERS_ONLY. I cannot stress this enough. Without it, the Glue crawler re-reads every historical file on every run. On a pipeline with six months of data, that turns a $2/month crawler into a $20/month one for no benefit. - Enforce the Athena workgroup configuration. Set
enforce_workgroup_configuration = true. Otherwise someone will accidentally write query results to an unencrypted bucket or blow past the byte scan cutoff. - Scope IAM roles to specific resource ARNs. Never
Resource: "*"on Firehose or Glue roles. If an ARN exists, use it. Limits the blast radius when (not if) something goes sideways. - Set up lifecycle policies from day one. Telemetry accumulates faster than you expect. A 90-day retention policy with a 30-day Standard-to-IA transition keeps storage costs from surprising you in month three.
- Monitor the Firehose
errors/prefix. Delivery failures are completely silent unless you watch for them. Set up an S3 event notification or a CloudWatch alarm. I once lost two days of telemetry before noticing because I skipped this step on an early deployment. - Start with Firehose, not Data Streams. Shard management is complexity you do not need for analytics telemetry. If sub-second latency or consumer fan-out becomes a requirement later, you can add a Data Stream upstream without redesigning anything.
- Codify everything. Terraform or Pulumi, I do not care which. Just do not make Firehose buffer or Glue crawler changes in the console. Manual tweaks drift, and when they cause an incident at 2 AM, nobody remembers what was changed.
Additional Resources
- Amazon Pinpoint Event Streams documentation
- Kinesis Data Firehose Dynamic Partitioning
- AWS Glue Crawler Configuration
- Athena Performance Tuning
- Terraform implementation: tf-config-telemetry-pipeline
- Pulumi implementation: pul-py-config-telemetry-pipeline
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

