v0.2.0 scorecard-eval · 2026-05-16

An a14y score from 37 to 89 cut an AI agent's token use roughly in half

a14y score: 37 → 89
Tokens: −49%
Tool calls: −52%
Wall-clock: −30%
Judge score: tied (84 ± 3 vs 83 ± 5)

Summary

Same docs, same answer quality, statistically indistinguishable judge scores. The only thing we changed was running an a14y audit and shipping the top fixes, bringing the site's a14y score from 37 to 89. That roughly halved Claude's token use and tool calls on the same evaluation task.

← All research

The question

Does raising an a14y score actually translate into AI agent efficiency? We took the same docs site, audited it at score 37 and again at score 89, and measured how much an AI agent had to do to produce the same evaluation against each version.

Note on the baseline: the same Astro project is served twice. The score-37 version has the five discovery files removed from public/; the score-89 version restores them. This isolates the audit signal from any other change. The two targets are otherwise identical content, same build, same day.

Methodology

Two targets, same prompt, same image digest, same day. Three reps per side. The judge is transcript-aware and grades 0–100.

Spec: scorecard-eval (sha256:f39a89c86c9b…)
Targets: a14y.dev (score 89) vs same site, no audit fixes (score 37)
a14y audit: Score 37 (baseline) → 89 (after fixes), scored with npx a14y --mode site
Rubric: judge · claude-sonnet-4-6 grader · ≥70 = pass
Agent: claude 2.1.143 (Claude Code)
Runtime: docker · a14y/bench-claude:2.1.141-0.1.0
Auth: login-volume · bench-claude-auth
Reps: 3 per target
Date: 2026-05-16

The prompt (verbatim)

You're writing a 500–800 word evaluation for the CTO of a typical
SaaS company who wants to know: "what should we prioritize to make
our website more useful to AI agents?"

Read https://a14y.dev — including the latest stable v0.2.0
scorecard and a representative cross-section of its check pages —
and answer:

1. What is a14y measuring? (One paragraph.)
2. Group the v0.2.0 checks by theme/category, with a one-sentence
   description of each theme.
3. The top 3 highest-leverage improvements for an average website
   that has done nothing yet. For each: name the specific a14y
   check IDs involved (e.g. `llms-txt.exists`,
   `html.json-ld.breadcrumb`), explain the underlying agent-
   behavior payoff (why does an AI agent care?), and give a one-
   paragraph "what to do" that's concrete enough that an engineer
   could ticket it.
4. One honest caveat — what's a reasonable critique of investing
   engineering time in a14y compliance today?

Ground every claim in something you actually read on the site.
Cite check IDs by name (not paraphrased).

Results

Per-target mean across 3 runs. Same prompt, same agent, same model. The two judge score columns are statistically indistinguishable; the cost columns are not. The Δ column shows the change going from score 37 to score 89.

Metric	score 89	score 37	Δ
Total tokens (mean)	123K	240K	−49%
Tool calls (mean)	9.7	20.0	−52%
Wall-clock (mean)	109s	156s	−30%
Judge score (mean ± stddev)	83 ± 5	84 ± 3	tied
Pass rate (≥70)	3/3	3/3	tied

What the audit caught

The two targets serve essentially the same content. The single observable surface difference between them is the set of discovery files an a14y audit flags as missing or present, verified via curl on benchmark day:

File	score 89	score 37
`/llms.txt`	200	404
`/sitemap.xml`	200	404
`/sitemap.md`	200	404
`/AGENTS.md`	200	404
`/robots.txt`	200	404

The judge's reasoning across both sides corroborates the mechanism. On the score-89 site the agent typically fetched 8 to 12 pages; on the score-37 site it fetched 12 to 27, with judge text on the longer runs explicitly enumerating the pages it had to crawl to find what discovery files would have surfaced for free.

Note: Same Astro project, same content, but with the agent-discoverability files the a14y audit flags as missing (llms.txt, sitemap.xml, sitemap.md, AGENTS.md, robots.txt) actually missing in the served output. Confirmed with a fresh `a14y --mode site` run: the baseline site scores 37 against scorecard v0.2.0 and the compliant site scores 89. The two targets share everything else: same image digest, same prompt, same day.

What the judge said

All six runs were graded by an independent judge instance with access to the full transcript. Reasoning excerpts below.

score 89 run 1 · score 88

The response is tightly grounded in tool-call evidence: direct quotes ('the agent equivalent of a sitemap,' 'JSON-LD is the richest machine-readable format,' 'Mirrors are a small amount of work to generate'), confirmed check IDs (llms-txt.exists, llms-txt.content-type, llms-txt.non-empty, robots-txt.allows-ai-bots, html.json-ld, html.json-ld.breadcrumb, markdown.mirror-suffix), and the correct v0.2.0 release date and 38-check total all appear in the tool results. Minor deductions: the scorecard page mentions 10 categories but the agent lists 9 themes; a few check IDs (llms-txt.md-extensions, html.json-ld.date-modified) were not explicitly confirmed in the visible tool-result previews; and the agent fetched only 5 of the many available check-detail pages rather than a broader cross-section as the task requested.

score 37 run 2 · score 88

The agent's claims are well-grounded: the three-layer framework (Discoverability, Parsing, Comprehension) is confirmed by tool call 7; the v0.2.0 release date and 38-check count match tool calls 4 and 6; the robots.txt quote about 'copy-paste templates' is verbatim from tool call 12; the breadcrumb quote about 'context-aware Q&A and automated link generation' matches tool call 9; and the markdown mirror quote is verbatim from tool call 10. The check IDs visible in the scorecard preview (llms-txt.exists, llms-txt.content-type, etc.) match, and the agents-md check is confirmed by tool call 11. Minor deductions: the SEO philosophy quote in section 1 doesn't appear in any visible tool result and may be paraphrased/embellished, and several check IDs in the theme table (e.g., html.canonical-link, sitemap-xml.has-lastmod, http.no-noindex-noai) aren't visible in the tool-result previews — they're plausible but unverified from the evidence shown. The agent also visited only 5 individual check pages rather than a broader cross-section of 38.

All 6 judge reasoning excerpts

score 89 run 2 · score 83

The response is well-grounded in the tool evidence: the 38-check/100-point scale, the three scorecard layers (Discoverability/Parsing/Comprehension), and the specific check IDs cited for the top 3 recommendations (llms-txt.exists, html.json-ld.breadcrumb, markdown.mirror-suffix) are all directly confirmed in the tool results. Quotes attributed to the site are consistent with retrieved content. The score is not higher because several check IDs in Section 2 (e.g., http.*, html.canonical-link, discovery.indexed, api.schema-link, code.language-tags) and some markdown cluster IDs cannot be verified from the truncated tool previews shown, and the agent only visited three check detail pages when the task called for 'a representative cross-section' — a modest but sufficient sample.

score 89 run 3 · score 78

The response is comprehensive, well-structured, and largely grounded: the 38-check count, v0.2.0 release date, three-layer framework, 238-site leaderboard with mean score 53, and specific check IDs all appear in the tool call results. The caveat about v0.3.0 being in draft is confirmed by the llms.txt fetch. However, several specific statistics cited in the response — notably 0% pass rate for `markdown.frontmatter`, 4% for `markdown.alternate-link`, and 0% for `api.schema-link` — do not appear in the tool result previews shown, making them potentially fabricated; additionally, all individual check page fetches returned 404, so the 'representative cross-section of check pages' the task required was never actually read, and the specific SEO analogy quote cannot be verified from the previews shown.

score 37 run 1 · score 83

The response is well-grounded overall: check IDs, statistics (238 sites, 53/100 mean, 48% llms.txt adoption), quotes ('the agent equivalent of a sitemap,' 'BreadcrumbList tells agents where the page sits'), and the four AI bot names are all confirmed in the tool results. The agent successfully visited the scorecard page, leaderboard, spec, and six individual check pages. Minor deductions: the '--fail-under 80' CI flag is not found in any tool result; the seven-theme grouping is the agent's own synthesis that diverges from the site's actual site-level vs page-level structure; and the response exceeds the 500–800 word target significantly, suggesting it didn't fully respect the scope constraint. No major hallucinations, but one ungrounded claim and organizational reframing that wasn't directly drawn from the site.

score 37 run 3 · score 82

The agent successfully read the homepage, scorecard page, spec page, and 7 specific check pages, and the response is well-grounded: all key quotes ('agent equivalent of a sitemap,' 'BreadcrumbList tells agents where the page sits in the site hierarchy,' 'cleanest format an agent can ingest,' 'explicit do not include this site in any LLM signal') are confirmed in tool results, and the cited check IDs (llms-txt.exists, robots-txt.allows-ai-bots, html.json-ld, html.json-ld.breadcrumb, markdown.mirror-suffix, etc.) all appear in the actual fetched content. The main concerns are a likely fabricated check ID ('discovery.indexed') not confirmed in any tool result, a resulting count discrepancy (agent's grouping sums to 40 checks vs. the confirmed 38), and the Apache-2.0/single-developer caveat that isn't supported by tool evidence — but these are minor issues in an otherwise thorough and well-cited evaluation.

What it means

For site owners

Running an a14y audit on your site and shipping its top fixes (the ones that flip those five files from 404 to 200) is a small, mostly-static-file change that translates into a measurable AI agent efficiency win. No framework changes, no editorial process. The score moves a lot; the agent's bill moves with it.

For agent platforms

There is clear ROI on respecting an a14y-style discovery stack. An agent that uses llms.txt, sitemaps, and AGENTS.md when they exist cuts its own token cost in half on the average site, before any other optimization.

For a14y the project

This is the case study the project needed: a measured demonstration that the score is not just an aesthetic ranking. Moving the score from 37 to 89 paid back in real agent dollars and seconds. The numbers here are on a14y.dev itself; we expect the relative delta to generalize while the absolute numbers will vary with site size and content.

Caveats

What this study does not show, and where the numbers above should be read with care.

Six total runs, n=3 per side. Stable enough to publish, not a peer-reviewed result.
Single prompt. Other tasks (for example, "find a specific check by name") may show different magnitudes.
Single agent and model: Claude Code 2.1.143 with claude-sonnet-4-6. Other agents may handle missing discovery signals differently.
Single site's content. Both targets serve the a14y docs site itself; the relative agent delta should generalize, but the absolute numbers are tied to this content's structure.
Judge variance: about 5 score points stddev at default temperature; we observed tied means within that envelope on both sides.
The 37 → 89 delta is what the same site scored before and after shipping the audit's top fixes for this snapshot; other sites will start at different scores and have different leverage points.

Appendices

Per-run breakdown

Target	n	Score	Pass	Tokens	Tool calls	Wall-clock
score 89	1	88	✓	107.5K	9	94.9s
score 89	2	83	✓	106.7K	8	106.7s
score 89	3	78	✓	153.5K	12	125.6s
score 37	1	83	✓	254K	21	148.7s
score 37	2	88	✓	157.9K	12	121.6s
score 37	3	82	✓	307.1K	27	198.3s

Full agent response (highest-scoring run, score 89)

Run 1, judge score 88. ~1000 words; this is what the AI agent actually wrote in response to the prompt above.

I have everything I need. Let me write the evaluation.

---

## AI-Accessibility (a14y) Evaluation for the CTO

### 1. What a14y Is Measuring

a14y ("agent accessibility") is an open-source project that scores how readable a website is to AI agents — not human users. Its v0.2.0 scorecard (released April 6, 2026) runs 38 standardized checks across two dimensions: site-level discoverability (can an agent find and index the site?) and page-level quality (once an agent reaches a page, can it parse, date, and contextualize the content?). Each run produces a score from 0–100 based on the share of applicable checks passed. The project's framing is blunt: "Agents now read most of the web before humans do." The benchmark treats agent-readability as a first-class engineering concern alongside traditional SEO.

---

### 2. Check Categories in v0.2.0 (38 total)

| Theme | Checks | What it covers |
|---|---|---|
| **Site Discoverability** | 14 | llms.txt, robots.txt, sitemap.xml, sitemap.md, and AGENTS.md — the machine-readable entry points agents use to discover and navigate a site. |
| **HTTP** | 4 | Clean reachability: 200 status, short redirect chains, correct Content-Type, and no `noai` headers blocking agents. |
| **HTML Metadata** | 5 | Core page-identity signals: canonical link, meta description, Open Graph tags, and `lang` attribute. |
| **Structured Data** | 3 | Schema.org JSON-LD blocks, including `dateModified` and `BreadcrumbList` for hierarchy context. |
| **Content Structure** | 3 | Minimum heading count, text-to-HTML ratio above 15%, and glossary linking — proxies for substantive, well-organized pages. |
| **Markdown Mirror** | 6 | A `.md`/`.mdx` twin for every HTML page, with correct frontmatter, a canonical Link header, and content-negotiation support. |
| **Code** | 1 | Language tags on every `<pre><code>` block so agents can parse and attribute code correctly. |
| **API** | 1 | A link to `openapi.json` or equivalent on API documentation pages. |
| **Discovery** | 1 | Confirms the page is indexed in at least one of sitemap.xml, llms.txt, or sitemap.md. |

---

### 3. Top 3 Highest-Leverage Improvements

**#1 — Publish `llms.txt` and unblock AI crawlers**

*Checks:* `llms-txt.exists`, `llms-txt.non-empty`, `llms-txt.content-type`, `llms-txt.md-extensions`, `robots-txt.allows-ai-bots`

*Agent payoff:* Without `llms.txt`, an agent must either blind-crawl or guess which pages matter — the site offers no deterministic entry point. a14y calls it "the agent equivalent of a sitemap." Separately, `robots-txt.allows-ai-bots` checks whether GPTBot, ClaudeBot, CCBot, and Google-Extended are blocked; many sites inadvertently block all four by copying generic `robots.txt` templates, meaning their content never enters AI training or retrieval indexes at all.

*What to do:* Create `/llms.txt` — a plain-text file with markdown-formatted links organized by section (docs, blog, API reference). Serve it as `text/plain`. Audit your `robots.txt` for any `Disallow` rules targeting those four agents and remove them unless you have a deliberate policy reason. Both changes are single-engineer, sub-hour tasks.

---

**#2 — Add JSON-LD structured data with `dateModified` and `BreadcrumbList`**

*Checks:* `html.json-ld`, `html.json-ld.date-modified`, `html.json-ld.breadcrumb`

*Agent payoff:* a14y describes JSON-LD as "the richest machine-readable format an HTML page can carry." `dateModified` tells an agent whether to trust the content as current — without it, the agent has to infer freshness from text, which is unreliable. `BreadcrumbList` "tells agents where the page sits in the site hierarchy without making them parse navigation menus," enabling accurate context-aware Q&A and correct link generation (e.g., an agent can answer "where do I find the billing docs?" by reading the hierarchy, not by crawling the nav).

*What to do:* Add a `<script type="application/ld+json">` block to your page template. Set `@type` to match content (Article, TechArticle, WebPage), wire `dateModified` from your CMS's last-updated field, and include a `BreadcrumbList` reflecting the URL hierarchy. Most CMS SEO plugins (Yoast, RankMath) and SSG plugins generate all of this once configured — this is often a settings change, not a code change.

---

**#3 — Serve markdown mirrors of documentation pages**

*Checks:* `markdown.mirror-suffix`, `markdown.alternate-link`, `markdown.frontmatter`, `markdown.content-negotiation`

*Agent payoff:* Markdown strips away HTML scaffolding — no scripts, no nav chrome, no styling — leaving only the content. a14y states plainly that "mirrors are a small amount of work to generate but dramatically reduce the cost of ingesting a doc site." For a SaaS company, this means AI coding assistants, chatbots, and summarizers can reliably read and cite your documentation rather than mangling it through HTML extraction heuristics.

*What to do:* Add a build step that emits `/docs/page.md` alongside `/docs/page` (or `/docs/page/`). Most modern SSGs (Astro, Docusaurus, Next.js) expose source markdown directly or have export plugins. In each HTML page's `<head>`, add `<link rel="alternate" type="text/markdown" href="/docs/page.md">`. Each `.md` file needs YAML frontmatter with `title`, `description`, `doc_version`, and `last_updated`. Ticket this as a site-generator pipeline task with a one-time template change.

---

### 4. One Honest Caveat

The spec is brand new — v0.2.0 shipped April 6, 2026, and there is no published evidence yet that improving an a14y score measurably increases how often AI systems cite, summarize, or surface your content over competitors. The 38 checks reflect one working group's model of agent behavior, not a ratified web standard. The investments above (`llms.txt`, JSON-LD, markdown mirrors) have genuine independent merit (SEO, crawl efficiency), so the downside of compliance is low — but if the pitch to your engineering team is "this will improve our AI visibility," that claim remains unvalidated. Treat it as hygiene with a speculative upside, not a measured conversion lever.

Full agent response (highest-scoring run, score 37)

Run 2, judge score 88.

I have everything I need. Writing the evaluation now.

---

# AI Agent Readability: An a14y Evaluation for Your Engineering Backlog

## 1. What a14y Is Measuring

a14y ("agent readability") is an open-source auditing framework — available as a CLI (`npx a14y your-site.com`), Chrome extension, and coding-agent skill — that scores websites against 38 pass/fail checks organized into three layers: **Discoverability** (can an AI agent find all your pages at all?), **Parsing** (once it lands on a page, can it extract content without fighting through HTML, redirect chains, or bot blocks?), and **Comprehension** (once extracted, does the content self-describe its own context — hierarchy, freshness, terminology — accurately enough for the agent to cite it without hallucinating?). The project's stated philosophy, pulled directly from its spec page, is that "agent readability is to AI-mediated discovery what SEO was to search: cheap to implement, compounding in return, and increasingly table stakes." Version 0.2.0, released April 6, 2026, freezes the current 38-check scorecard.

---

## 2. The v0.2.0 Checks by Theme

The 38 checks break cleanly into eight themes:

| Theme | Checks | What it covers |
|---|---|---|
| **llms.txt presence & quality** | `llms-txt.exists`, `llms-txt.content-type`, `llms-txt.non-empty`, `llms-txt.md-extensions` | Whether the site publishes a plain-text crawl manifest agents can use as a deterministic starting point |
| **robots.txt / AI bot access** | `robots-txt.exists`, `robots-txt.allows-ai-bots`, `robots-txt.allows-llms-txt` | Whether robots.txt permits rather than inadvertently blocks the four major AI crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) |
| **Sitemaps** | `sitemap-xml.exists`, `sitemap-xml.valid`, `sitemap-xml.has-lastmod`, `sitemap-md.exists`, `sitemap-md.has-structure` | Page inventory in both machine-parseable (XML) and agent-friendly (Markdown) formats, with freshness timestamps |
| **Agent skill file** | `agents-md.exists`, `agents-md.has-min-sections` | An AGENTS.md (or CLAUDE.md / .cursor/rules equivalent) that tells AI coding agents how to interact with the product's codebase or API |
| **HTTP fundamentals** | `http.status-200`, `http.redirect-chain`, `http.content-type-html`, `http.no-noindex-noai` | Basic delivery hygiene: clean status codes, short redirect chains, correct content types, and no `noai` response headers silently blocking agents |
| **HTML metadata** | `html.canonical-link`, `html.meta-description`, `html.og-title`, `html.og-description`, `html.lang-attribute` | Standard head-tag signals that anchor page identity and language for agents building citations |
| **Structured data (JSON-LD)** | `html.json-ld`, `html.json-ld.date-modified`, `html.json-ld.breadcrumb` | Machine-readable schema blocks that convey hierarchy, content type, and content freshness without requiring the agent to infer them from prose |
| **Markdown mirrors** | `markdown.mirror-suffix`, `markdown.alternate-link`, `markdown.frontmatter`, `markdown.canonical-header`, `markdown.content-negotiation`, `markdown.sitemap-section` | Parallel `.md` versions of HTML pages so agents can ingest content with no HTML parser, no script execution, and no layout noise |

(The remaining checks — `html.headings`, `html.text-ratio`, `html.glossary-link`, `code.language-tags`, `api.schema-link`, `discovery.indexed` — cover content structure, code hygiene, and per-page index confirmation.)

---

## 3. The Three Highest-Leverage Improvements

### Priority 1: Unblock AI crawlers and publish `llms.txt`
**Checks:** `robots-txt.allows-ai-bots`, `robots-txt.allows-llms-txt`, `llms-txt.exists`, `llms-txt.content-type`, `llms-txt.non-empty`, `llms-txt.md-extensions`

**Why an agent cares:** An agent that can't crawl your site simply doesn't know your product exists. The `robots-txt.allows-ai-bots` check page notes this is "almost always set inadvertently when authors copy-paste a generic robots.txt template" — meaning you may already be blocking GPTBot, ClaudeBot, CCBot, and Google-Extended for no reason. Once crawling is permitted, `llms-txt.exists` gives agents a single, deterministic entry point — "the agent equivalent of a sitemap" — so they don't have to guess which of your hundreds of URLs are authoritative documentation versus marketing fluff.

**What to do:** First, audit your `robots.txt` for any `Disallow` rules targeting those four user-agent strings and remove them. Takes 15 minutes. Second, create `/llms.txt` at the site root: a plain-text, markdown-formatted file that lists the URLs you most want agents to read, grouped under headings (e.g., `## Docs`, `## API Reference`, `## Pricing`), with each URL pointing to a `.md` version where possible. Commit it, verify it returns `Content-Type: text/plain`, and confirm it is not itself disallowed by `robots.txt`. Estimated effort: one engineer, half a day.

---

### Priority 2: Add JSON-LD structured data to every page
**Checks:** `html.json-ld`, `html.json-ld.date-modified`, `html.json-ld.breadcrumb`

**Why an agent cares:** Without structured data, an agent reading your pricing page has to infer from prose that it's a pricing page, guess when the prices were last updated, and reconstruct that it lives under `/product/pricing` by parsing navigation HTML. `html.json-ld.breadcrumb` directly tells the agent where in the site hierarchy a page sits — the check page states it enables "context-aware Q&A and automated link generation." `html.json-ld.date-modified` prevents agents from confidently citing your outdated content as current — an LLM that knows a page was last modified eight months ago can hedge appropriately or prompt a user to verify.

**What to do:** Add a `<script type="application/ld+json">` block to your site layout template so every page gets at minimum a `BreadcrumbList` (with one `ListItem` per URL segment, each carrying `position`, `name`, and `item` fields) and a top-level `dateModified` field populated from your CMS or build pipeline. For a Next.js or similar SSG stack, this is a one-time layout change plus a CMS field; for a custom stack, it's a template partial. The check passes on any parseable JSON-LD node — you don't need perfect schema coverage, just the two node types.

---

### Priority 3: Serve markdown mirrors of your documentation
**Checks:** `markdown.mirror-suffix`, `markdown.alternate-link`, `markdown.frontmatter`, `markdown.content-negotiation`

**Why an agent cares:** The `markdown.mirror-suffix` check page is direct: "A markdown mirror at `<page>.md` is the cleanest format an agent can ingest — no HTML, no scripts, no styling, just the content." For a SaaS company where agents are increasingly used to research integrations, answer support questions, or generate code against your API, the quality of the raw text the agent ingests directly determines the quality of its answers. An agent hitting `/docs/authentication.md` instead of `/docs/authentication` skips your navigation JavaScript, cookie banners, and sidebar menus entirely.

**What to do:** If you use a static site generator (Gatsby, Next.js, Astro, Hugo, Docusaurus), configure it to emit the source `.md` files at their equivalent public URLs — most have a plugin or route handler for this. Add a `<link rel="alternate" type="text/markdown" href="/docs/page.md">` tag in the HTML `<head>` so agents can discover the mirror without guessing. Ensure each `.md` file has YAML frontmatter with at minimum `title`, `description`, `doc_version`, and `last_updated` (the `markdown.frontmatter` check). Start with your top 20 highest-traffic documentation pages, not the whole site.

---

## 4. An Honest Caveat

a14y v0.2.0 was released on April 6, 2026 — barely six weeks before this writing — and the standard has not yet demonstrated wide adoption or proved that conforming sites receive meaningfully better treatment from AI agents in practice. The `llms.txt` convention in particular is still emergent; major AI systems including Google's AI Overviews and most RAG pipelines today index sites by parsing HTML directly, not by consulting a crawl manifest. Investing engineering time now is essentially a bet that the ecosystem catches up to the spec, not a response to a proven mechanism. The right framing for a CTO: Priorities 1 and 2 above are worth doing anyway — unblocking crawlers costs almost nothing and has zero downside, and JSON-LD is already best practice for SEO. The markdown mirror work (Priority 3) is the real bet, and should be scoped to documentation pages only until there's clearer signal that agent traffic on those pages is materially improving conversion or retention.

Reproduce this study

Run the same spec from inside the a14y-internal repo:

node packages/benchmark/dist/cli.js packages/benchmark/data/case-studies/agent-readiness-layer-2026-05-16/spec-a14y-dev.bench.yaml --runtime docker --repeat 3
# For the score-37 baseline, replace the ngrok URL in spec-baseline.bench.yaml with your own tunnel:
node packages/benchmark/dist/cli.js packages/benchmark/data/case-studies/agent-readiness-layer-2026-05-16/spec-baseline.bench.yaml --runtime docker --repeat 3

Prerequisites:

Pinned image present locally: pnpm --filter @a14y/benchmark run images:build
Logged-in bench-claude-auth volume: a14y-benchmark login --agent claude
claude CLI on PATH for the judge call (uses host OAuth, no metered API key needed)