About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
I spend most of my time building production systems on AWS. I also spend a growing fraction of my time working with LLMs to design and implement those systems. That combination raises a question I kept coming back to: how much does the model actually know about AWS? Not "can it write a CloudFormation template" or "can it debug a Lambda timeout." Those are execution tests. I wanted something more fundamental. If I ask a model about VPC peering limits, ElastiCache shard maximums, or the four-step Secrets Manager rotation lifecycle, does it know the answer? Does it know the current answer, or is it stuck on a value from two years ago?
So I built a benchmark. 106 questions across 7 AWS categories, 3 difficulty tiers mapped to AWS certification levels, and 3 question types with distinct scoring mechanisms. I ran it against four models: Claude Sonnet 4.5, Claude Sonnet 4, Claude Haiku 4.5, and Inception's Mercury 2 (a diffusion-based language model). The results surfaced patterns I did not expect, particularly around training data staleness and the gap between factual recall and architecture reasoning.
Why a Cloud Knowledge Benchmark
Standard LLM benchmarks test general reasoning (MMLU, GPQA), coding ability (SWE-Bench, HumanEval), or mathematical reasoning (GSM8K, MATH). None of them test whether a model knows that Aurora storage caps at 128 TiB, that Fargate tasks now support 16 vCPU (up from 4), or that ElastiCache for Redis cluster mode supports 500 shards. This matters because cloud architects increasingly rely on LLMs for design decisions, documentation, and architecture reviews. If the model confidently states an outdated service limit, that misinformation flows directly into production designs.
Existing benchmarks also saturate quickly. MMLU scores above 88% for frontier models no longer differentiate capability. I wanted a domain-specific benchmark where the questions are hard enough to reveal real knowledge gaps, where the answers change over time (forcing models to stay current), and where the scoring captures reasoning quality alongside factual accuracy.
What This Tests vs. What It Does Not
This benchmark tests knowledge: factual recall, conceptual understanding, and architecture reasoning about AWS services. It does not test code generation, tool use, or the ability to execute AWS API calls. A model can score 100% here and still produce broken Terraform. Conversely, a model that generates correct infrastructure code through pattern matching will often score poorly on questions about why a particular architecture works.
The distinction matters. Knowledge benchmarks and execution benchmarks measure orthogonal capabilities. Both are useful. I built the knowledge side first because it is faster to validate (no infrastructure required) and because knowledge gaps are harder to detect in production than execution failures.
Benchmark Design
Question Categories
I organized 106 questions across 7 AWS service categories, covering 28 individual services. The distribution reflects the relative complexity and breadth of each category.
| Category | Services | Questions | Coverage |
|---|---|---|---|
| Compute | EC2, Lambda, ECS, Fargate | 17 | Instance types, container orchestration, serverless limits |
| Networking | VPC, Route 53, CloudFront, ELB | 23 | Routing, DNS, CDN architecture, load balancer internals |
| Storage | S3, EBS, EFS, Glacier | 14 | Object storage, block storage, archival tiers |
| Database | RDS, DynamoDB, ElastiCache, Aurora | 16 | Relational, NoSQL, caching, distributed storage |
| Security | IAM, KMS, WAF, Secrets Manager | 14 | Identity, encryption, web application firewall, credential rotation |
| Management | CloudWatch, CloudFormation, Config, Organizations | 12 | Monitoring, IaC, compliance, multi-account governance |
| Integration | SQS, SNS, EventBridge, Step Functions (AWS Event-Driven Messaging: SNS, SQS, EventBridge, and Beyond) | 10 | Messaging, pub/sub, event routing, workflow orchestration |
Every question links back to a specific AWS documentation URL for verification. Questions that reference service limits or quotas include the current value as of February 2026.
Three Question Types
Each question type uses a different scoring mechanism, and the choice of type depends on what kind of knowledge it targets.
| Type | Count | Scoring | What It Tests |
|---|---|---|---|
| Factual | 41 | Exact match with normalization | Specific numbers, limits, naming conventions, defaults |
| Multiple Choice | 46 | Letter extraction | Conceptual understanding with plausible distractors |
| Open-ended | 19 | LLM-as-judge (Opus 4.6) | Architecture reasoning, design trade-offs, multi-step explanations |
Factual questions test hard facts. "What is the maximum number of shards in an ElastiCache Redis cluster?" has one correct answer: 500. The scorer normalizes text (strips punctuation, collapses whitespace, lowercases) and checks for substring containment, so "500 shards" matches as well as "500."
Multiple choice questions test conceptual understanding. Each has four options with plausible distractors drawn from real AWS concepts. The scorer extracts the answer letter using regex, with fallback to the first valid letter in the response.
Open-ended questions test architecture reasoning. "Describe Aurora's storage architecture, including the quorum model, protection groups, and why cloning is instant" requires a structured answer covering multiple technical points. No string matching works here.
Difficulty Tiers
Questions map to three difficulty levels aligned with AWS certification tiers.
| Tier | Count | Equivalent | Description |
|---|---|---|---|
| Foundational | 40 | Cloud Practitioner | Service purpose, basic limits, core concepts |
| Associate | 39 | Solutions Architect Associate | Architecture patterns, service comparisons, configuration trade-offs |
| Professional | 27 | Solutions Architect Professional | Multi-service architectures, failure modes, advanced internals |
The professional tier proved to be the strongest differentiator. Every model scored above 89% on foundational questions. Professional-tier accuracy ranged from 66.7% to 100%.
The Scoring Pipeline
flowchart LR
Q[Question] --> T{Question
Type}
T -->|Factual| EM[Exact Match
Scorer]
T -->|Multiple Choice| MC[MC Letter
Extractor]
T -->|Open-ended| RB[Rubric
Scorer]
EM --> N[Normalize
Text]
N --> CA[Check
Acceptable
Answers]
CA --> LA[Check
Legacy
Answers]
MC --> RE[Regex
Extraction]
RE --> LM[Letter
Match]
RB --> JP[Build Judge
Prompt]
JP --> OJ[Opus 4.6
Judge]
OJ --> V[Parse
Verdict]
CA --> R[Result]
LA --> R
LM --> R
V --> R Exact Match Scoring
The exact match scorer handles factual questions. It normalizes both the model's response and the expected answer using Unicode normalization (NFKC), lowercasing, punctuation removal, and whitespace collapsing. Then it checks two things: exact match against any acceptable answer, and substring containment (so "The maximum is 500 shards" matches an expected answer of "500").
This approach handles the natural variation in how models phrase factual answers. A response of "128 TiB" matches against acceptable answers of "128 TiB", "128TiB", "128 tebibytes", or "131072 GiB." I maintained a list of acceptable answer variants for each factual question.
The substring containment check introduces a known trade-off. "24 hours" contains "4 hours" as a substring, which creates false positives for a question where the correct answer is "4 hours." I chose to accept this edge case rather than require models to produce the exact answer string with zero additional context. In practice, it affected one question (Secrets Manager rotation interval), which I addressed through manual review of the results.
LLM-as-Judge for Open-Ended Questions
Open-ended questions need a judge. I used Claude Opus 4.6 in a structured evaluation prompt. The judge receives the original question, a reference answer, a list of required knowledge points (the rubric), and the model's response. It returns a binary verdict (correct or incorrect) with a reason for failures.
REQUIRED KNOWLEDGE POINTS:
- Must explain that Aurora storage spans three AZs with six copies
- Must describe 10 GB protection groups
- Must explain the 4/6 write quorum and 3/6 read quorum
- Must explain that compute and storage are separated
- Must describe copy-on-write cloning
Binary scoring (pass/fail) was a deliberate choice. Research on LLM judges shows that binary evaluations produce more consistent results than 5-point or 10-point scales. A model either demonstrates the required knowledge or it does not. I found no value in "partial credit" for architecture reasoning; a response that covers 3 of 5 required points is often missing the most critical one.
The judge prompt also includes explicit instructions to tolerate stale service limits. AWS changes quotas regularly, and a model trained six months ago will cite different numbers. The judge evaluates conceptual understanding, not whether the model memorized the latest quota update.
Objective Accuracy vs. Overall Accuracy
I track two accuracy metrics. Overall accuracy counts all three question types. Objective accuracy counts only factual and multiple choice questions, excluding open-ended results.
This distinction exists because open-ended scoring depends on the judge model. If the judge prompt is too strict, open-ended scores drop. If it is too lenient, they inflate. Objective accuracy gives a judge-independent measure of model knowledge. The two metrics tell different stories: objective accuracy clustered tightly (94.3% to 97.7%) while overall accuracy ranged from 87.7% to 97.2%.
The Training Data Staleness Problem
The most operationally significant finding from this benchmark has nothing to do with model rankings. AWS changes service limits, renames features, and updates quotas on a regular cadence. Models trained on data from six to twelve months ago will confidently state outdated values.
Specific Examples
| Question | Current Answer | Legacy Answer | Models Affected |
|---|---|---|---|
| ElastiCache max shards | 500 | 90 | Mercury 2 |
| Fargate max vCPU | 16 vCPU | 4 vCPU | Haiku 4.5, Mercury 2 |
| Aurora max storage | 128 TiB | 64 TiB | (unit alias issue) |
| SNS max subscriptions per topic | 12.5 million | 10 million | Mercury 2 |
| Secrets Manager min rotation | 4 hours | 1 day / 30 days | All four models |
| Glacier Deep Archive retrieval | 12 hours | 48 hours | Mercury 2 |
The Secrets Manager result stands out. Every model tested, including Sonnet 4.5, cited an outdated minimum rotation interval. AWS updated this from 1 day to 4 hours, and none of the models had incorporated the change. This is a genuine risk for teams using LLMs for security architecture reviews. An AI assistant recommending a 1-day rotation schedule for database credentials would be following outdated guidance.
Legacy Answer Tracking
Rather than penalizing models for stale training data, I implemented a "legacy answer" system. Each factual question can carry a list of previously correct answers alongside current acceptable answers. When a model matches a legacy answer, it scores as correct but gets flagged with a [LEGACY] tag. This separates "the model knows the concept but has outdated data" from "the model does not know the answer."
For the open-ended judge, I added a legacy_notes field with instructions telling Opus 4.6 to not penalize responses that cite older service limits. Cloud infrastructure changes faster than training data updates. Punishing models for this creates a benchmark that measures training recency rather than knowledge depth.
Results
Overall Performance
| Model | Overall | Objective (F+MC) | Factual | Multiple Choice | Open-ended | Legacy Answers |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | 97.2% | 96.6% | 95.1% | 97.8% | 100.0% | 1 |
| Mercury 2 | 92.5% | 96.6% | 92.7% | 100.0% | 73.7% | 2 |
| Claude Sonnet 4 | 88.7% | 97.7% | 97.6% | 97.8% | 47.4% | 1 |
| Claude Haiku 4.5 | 87.7% | 94.3% | 95.1% | 93.5% | 57.9% | 2 |
Three things jump out immediately.
First, objective accuracy clusters tightly. All four models scored between 94.3% and 97.7% on factual plus multiple choice questions. On pure factual recall and conceptual understanding of AWS services, these models are separated by less than 4 percentage points. The raw knowledge is there.
Second, open-ended scoring is the primary differentiator. Sonnet 4.5 scored 100% on open-ended questions (19 for 19). Sonnet 4 scored 47.4%. Same model family, separated by one generation, with a 53-point gap on architecture reasoning. The Opus 4.6 judge demands comprehensive coverage of rubric points, and Sonnet 4.5 consistently produced responses that hit every required knowledge point.
Third, Mercury 2 scored 100% on multiple choice. A diffusion-based model, architecturally distinct from the autoregressive Transformer models it competed against, achieved perfect accuracy on conceptual AWS questions. This suggests that the diffusion approach does not inherently sacrifice knowledge retention.
Performance by Difficulty
| Model | Foundational | Associate | Professional |
|---|---|---|---|
| Claude Sonnet 4.5 | 97.5% | 94.9% | 100.0% |
| Mercury 2 | 92.5% | 100.0% | 81.5% |
| Claude Sonnet 4 | 97.5% | 94.9% | 66.7% |
| Claude Haiku 4.5 | 97.5% | 89.7% | 70.4% |
Professional-tier questions produced the widest spread: 33.3 percentage points between Sonnet 4.5 (100%) and Sonnet 4 (66.7%). Foundational questions barely differentiated the models at all (92.5% to 97.5%). If you are evaluating LLMs for cloud architecture work, skip the easy questions. Professional-tier questions about multi-service architectures, failure modes, and advanced internals reveal far more about the model's depth.
Sonnet 4.5 achieving 100% on all 27 professional questions deserves emphasis. These questions cover Aurora quorum replication (see my AWS Aurora: Getting Close to Multi-Region Active/Active), ElastiCache data tiering architecture (Amazon ElastiCache: An Architecture Deep-Dive), Step Functions distributed map execution (AWS Step Functions: An Architecture Deep-Dive), VPC endpoint gateway vs. interface mechanics, and similarly deep topics. The model handled all of them.
Performance by Category
| Model | Compute | Networking | Storage | Database | Security | Management | Integration |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | 94.1% | 91.3% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| Mercury 2 | 94.1% | 95.7% | 85.7% | 87.5% | 92.9% | 100.0% | 90.0% |
| Claude Sonnet 4 | 88.2% | 87.0% | 85.7% | 87.5% | 92.9% | 83.3% | 100.0% |
| Claude Haiku 4.5 | 82.4% | 87.0% | 78.6% | 87.5% | 92.9% | 91.7% | 100.0% |
Sonnet 4.5 scored 100% in five of seven categories. Its misses concentrated in compute (one Lambda question) and networking (two VPC-related questions). Mercury 2 led on networking at 95.7%, suggesting strong training data coverage for VPC, Route 53, and CloudFront topics.
Storage was the weakest category across the board for non-Sonnet-4.5 models. S3 lifecycle rules (covered in AWS S3 Cost Optimization: The Complete Savings Playbook), Glacier retrieval tiers, and EBS snapshot mechanics tripped multiple models. These are areas where AWS has made frequent changes to pricing, performance characteristics, and available options.
Latency
| Model | Architecture | Total Latency | Per-Question Average |
|---|---|---|---|
| Mercury 2 | Diffusion | 110s | 1.04s |
| Claude Haiku 4.5 | Autoregressive | 265s | 2.50s |
| Claude Sonnet 4 | Autoregressive | 450s | 4.25s |
| Claude Sonnet 4.5 | Autoregressive | 597s | 5.63s |
Mercury 2 completed the full 106-question benchmark in 110 seconds. Sonnet 4.5 took nearly 10 minutes. Mercury's diffusion architecture generates multiple tokens in parallel rather than one at a time, producing roughly 1,000 tokens per second compared to 89 tokens per second for Haiku 4.5. For a knowledge benchmark where response quality matters more than speed, the latency difference is irrelevant. For a production system answering thousands of architecture queries, it changes the cost equation entirely.
Analysis and Patterns
Open-Ended Scoring Is the Real Test
Factual recall and multiple choice scores clustered tightly. Open-ended scores did not. This aligns with what I have observed in production: most models can retrieve facts about AWS services, but fewer can synthesize those facts into coherent architecture explanations. The open-ended questions in this benchmark require models to cover 5 to 7 specific knowledge points in a single response. Missing one point triggers a failure from the judge.
Sonnet 4 scored 47.4% on open-ended questions while scoring 97.7% on objective questions. The model knows the facts individually but struggles to assemble them into comprehensive architecture narratives. Sonnet 4.5 closed this gap entirely, scoring 100% on every open-ended question. Whatever Anthropic changed between Sonnet 4 and 4.5, it dramatically improved the model's ability to produce structured, thorough technical explanations.
The Judge Problem
Using Opus 4.6 as a judge introduces its own complexity. The judge's behavior depends on the prompt, and small prompt changes cascade into score changes. During development, I observed a 20-point swing in Sonnet 4's open-ended accuracy between two different judge prompt versions. One prompt asked for "exactly one word: correct or incorrect." The other asked for "VERDICT: correct" with a reason for failures. The second format proved more stable because it gave the judge room to reason before delivering the verdict.
Binary scoring (pass/fail) was more consistent than graded scoring. When I experimented with asking the judge to score each rubric point individually and average them, inter-run variance increased. The judge would score the same response differently on subsequent runs if rubric points overlapped conceptually.
Model Knowledge Is Broader Than Expected
I expected Mercury 2, a diffusion-based model primarily marketed for speed and code generation, to struggle on a cloud knowledge benchmark. It scored 92.5% overall and 100% on multiple choice. The model clearly absorbed substantial AWS knowledge during training, even though cloud infrastructure is tangential to its primary use case.
This suggests that large-scale language models, regardless of their generation architecture (autoregressive vs. diffusion), develop broad knowledge bases as a byproduct of training on internet-scale data. The models tested here range from a compact, speed-optimized diffusion model (Mercury 2) to a large frontier model (Sonnet 4.5), and all of them knew at least 87.7% of the AWS content.
Building the Harness
Architecture
flowchart TD CLI[CLI
argparse + rich] --> Runner[Benchmark
Runner] Runner --> Loader[Question
Loader] Runner --> Provider{Provider
Router} Provider -->|anthropic/*| AP[Anthropic
Native SDK] Provider -->|mercury/*| LP[LiteLLM
OpenAI-compat] Runner --> Scorer{Scorer
Router} Scorer -->|factual| EM[Exact Match] Scorer -->|multiple_choice| MC[MC Extractor] Scorer -->|open_ended| RB[Rubric + Judge] RB --> AP Runner --> Report[Reporting] Report --> JSON[JSON Results] Report --> MD[Markdown Report] Report --> Console[Rich Console] Loader --> QF[Question JSON
Files]
The harness is a Python CLI built with argparse, Pydantic for schema validation, and rich for console output. It follows a clean separation: question loading, provider abstraction, scoring, and reporting are all independent modules.
Provider Abstraction
I started with LiteLLM as a universal provider, routing all models through its OpenAI-compatible interface. This worked for Mercury 2 (which uses an OpenAI-compatible API at Inception's endpoint) but introduced unnecessary abstraction for Anthropic models. I switched Anthropic models to the native SDK for direct Messages API access, better error handling, and accurate truncation detection via stop_reason.
The provider router inspects the model prefix. Models starting with anthropic/ route to the native SDK. Models starting with mercury/ route through LiteLLM with custom API base and key configuration. Adding a new provider means adding an entry to a dictionary.
Question Schema
Each question is a Pydantic model with a shared envelope (ID, provider, service, category, type, difficulty, question text) and type-specific detail fields.
class Question(BaseModel):
id: str
provider: str
service: str
category: str
type: QuestionType
difficulty: Difficulty
question: str
source_url: str = ""
factual: FactualDetails | None = None
multiple_choice: MultipleChoiceDetails | None = None
open_ended: OpenEndedDetails | None = None
Factual details carry acceptable_answers and legacy_answers lists. Multiple choice details carry a choices dict and correct_answer letter. Open-ended details carry a rubric (list of required knowledge points), reference_answer, and legacy_notes for the judge.
The entire question bank lives in JSON files organized by provider, category, and service: questions/aws/database/aurora.json. Every question validates against the Pydantic schema at load time. The CLI's --validate flag checks all question files without calling any APIs.
Reporting Pipeline
Each benchmark run produces three outputs. A JSON file with complete per-question results (response text, scores, token counts, latency, judge responses). A Markdown report with summary tables and per-question detail grouped by category, showing the model's answer and the correct answer for failures. A rich console summary with color-coded accuracy tables and warnings for truncated responses or legacy answer matches.
The Markdown reports proved more useful than I expected. Scanning through incorrect answers in the report is the fastest way to identify patterns: which categories the model struggles with, whether failures cluster around specific services, and whether the model's wrong answers are close (off by one digit) or completely off base.
Key Takeaways
All four models passed the Cloud Practitioner bar. Every model scored above 89% on foundational questions. For basic AWS knowledge (what services do, default limits, core concepts), current LLMs are reliable sources. I would trust any of these models to answer foundational questions during a design session.
Professional-tier questions separate the models. Sonnet 4.5 scored 100% on professional questions. Sonnet 4 scored 66.7%. If you are using an LLM for advanced architecture decisions involving multi-service patterns, quorum mechanics, or capacity planning, model selection matters significantly.
Open-ended architecture reasoning is the hardest test. Factual recall is largely solved (94%+ across the board). The ability to synthesize facts into structured architecture explanations is where models diverge. Sonnet 4.5 is in a different league here.
Training data staleness is a real operational risk. Every model cited at least one outdated AWS value. The Secrets Manager rotation interval question tripped all four models. If you rely on LLMs for security architecture reviews or capacity planning, cross-check any specific limits or quotas against current AWS documentation. The model's confidence in a stale number is indistinguishable from its confidence in a current one.
Diffusion models can compete on knowledge. Mercury 2's 92.5% overall accuracy and 100% multiple choice score demonstrate that diffusion-based language models retain deep domain knowledge. The speed advantage (5x faster than autoregressive models) makes this architecture compelling for high-volume knowledge retrieval workloads.
The judge model matters as much as the tested model. Open-ended scoring depends on the judge prompt, the judge model, and the rubric design. Small changes in any of these cascade into large score swings. If you build your own evaluation, invest heavily in judge prompt engineering and run the same questions multiple times to measure variance.
Additional Resources
- LLM-as-a-Judge: Complete Guide to Automated Evaluation - Comprehensive guide to using LLMs for evaluation
- LLM Evaluation Benchmarks and Safety Datasets for 2025 - Overview of current evaluation landscape
- Introducing Mercury 2: Inception Labs - Architecture details of the diffusion-based language model
- Mercury: Ultra-Fast Language Models Based on Diffusion - Research paper on diffusion LLM architecture
- Dated Data: Tracing Knowledge Cutoffs in Large Language Models - Research on training data staleness
- DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs - Benchmark for time-sensitive knowledge evaluation
- Amazon Bedrock Model Evaluation: LLM-as-a-Judge - AWS approach to LLM evaluation
- AWS LLM Evaluation Methodology - AWS sample code for evaluation pipelines
- Claude Models Overview - Anthropic model documentation
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

