A new study just quantified what many suspected but few could prove: every large language model fabricates answers, and more context makes it worse.
The paper — "How Much Do LLMs Hallucinate in Document Q&A Scenarios?" by Roig (March 2026) — tested 35 open-weight models across 172 billion tokens of evaluation. Not a benchmark. Not a leaderboard. A ground-truth-first methodology called RIKER that generates documents from known facts, then asks models about them. No contamination. No LLM-as-judge bias. Just deterministic scoring at massive scale.
Let's unpack what that means:
| Condition | Fabrication Rate | What it means |
|---|---|---|
| Best model, 32K context | 1.19% | 1 in 84 answers fabricated — under perfect conditions nobody uses |
| Top-tier models, typical use | 5–7% | 1 in 15–20 answers fabricated |
| Median of all 35 models | ~25% | 1 in 4 answers fabricated |
| Any model at 200K context | >10% | Every model exceeds 10% — context window = fabrication amplifier |
That last row is the killer. The industry is racing to build longer context windows — 200K, 1M, 2M tokens. This study shows that longer context doesn't mean better answers. It means more fabrication.
The study's most important finding isn't the fabrication rates. It's this:
"Grounding ability and fabrication resistance are distinct capabilities — models that excel at finding facts may still fabricate facts that do not exist."
Read that again. A model can be excellent at retrieving information from documents and simultaneously excellent at inventing information that doesn't exist. These are separate skills. You can't test for one and assume the other.
This breaks the "just give it the documents" argument. RAG doesn't solve hallucination. Context stuffing doesn't solve hallucination. The model will find your real facts AND confidently fabricate fake ones, and you can't tell the difference from the output alone.
If a chatbot hallucinates, a human might catch it. If an autonomous agent hallucinates, it acts on it.
Consider the agentic commerce stack being built on Base and Ethereum right now. Agents are getting wallets (ERC-8004), payment rails (x402), and commerce protocols (ERC-8183). They can negotiate, pay, and settle — autonomously.
Now combine that with 5–7% fabrication rates. An agent processes a contract, fabricates a clause that doesn't exist, and executes payment based on it. No human in the loop. The money moves.
The stack verifies identity, authentication, and payment. Nobody verifies the output.
That's the gap this study quantifies — and exactly why output verification isn't optional for autonomous systems.
The study tested four temperatures (0.0, 0.4, 0.7, 1.0). T=0.0 gives best accuracy only ~60% of the time. And it increases coherence loss (infinite loops) by up to 48× compared to T=1.0. There's no magic temperature setting.
Model family predicts fabrication resistance better than model size. The overall accuracy range spans 72 percentage points across the 35 models. Size isn't the answer — architecture and training are.
Results were consistent across NVIDIA H200, AMD MI300X, and Intel Gaudi3. Hardware doesn't change fabrication behavior.
Running the same model 5 times (like Anthropic's multi-agent code review) doesn't help. If the model has a systematic fabrication pattern — and this study shows they do — you'll get 5 copies of the same fabrication. Common-mode failure.
Cross-model verification.
The study shows that fabrication patterns are model-family-specific. Different model families fabricate differently. A fact that GPT-5 fabricates, DeepSeek may correctly refuse. A hallucination from Llama, Qwen may catch.
This is exactly the thesis behind multi-model verification: if fabrication is model-specific, then independent verification across model families is the only scalable defense. Not because any single verifier is perfect — but because their failure modes are uncorrelated.
ThoughtProof's approach: Every claim is verified by 3+ models from different provider families. We measure agreement (MDI — Model Diversity Index) and only attest when independent models converge. If they disagree, that's a signal, not a bug.
pot-sdk (MIT) · Verification API · Agentic Commerce
Let's do the math for an enterprise deploying a top-tier model at 128K context:
At ~15% fabrication rate (128K, top-tier model from the study):
For a compliance tool, a medical system, or an agent executing financial transactions — that's not an acceptable error rate. That's a lawsuit rate.
Why can't the big AI providers solve this? Because of a structural conflict of interest:
| Provider | Approach | Problem |
|---|---|---|
| Anthropic | 5× Claude agents | Same model, same failure modes |
| xAI | Multiple Grok instances | Same model, same failure modes |
| OpenAI | o-series "self-checking" | Same model, same failure modes |
Every provider's incentive is to sell more of their own tokens. Multi-model verification requires using competitors' models. No provider will build that. It has to come from a neutral layer.
172 billion tokens. 35 models. The conclusion is unambiguous:
The question isn't whether your AI will hallucinate. It's whether you'll know when it does.
Paper: Roig, "How Much Do LLMs Hallucinate in Document Q&A Scenarios?" arXiv:2603.08274, March 2026
ThoughtProof is the epistemic verification layer for AI systems. Multi-model consensus with cryptographic receipts.
GitHub · npm · API · Agentic Commerce