research Empirical findings about agent readability

Research

Measured results from running real AI agents against real websites at different a14y scores.

v0.2.0 2026-05-16

Raising an a14y score from 37 to 89 halved an AI agent's token use

Same docs, same answer quality, statistically indistinguishable judge scores. The only thing we changed was running an a14y audit and shipping the top fixes, bringing the site's a14y score from 37 to 89. That roughly halved Claude's token use and tool calls on the same evaluation task.

Read the case study →