Research Hallucination Multi-Model
March 11, 2026 · 7 min read

172 Billion Tokens of Hallucination: Why No Single Model Can Be Trusted

A new study just quantified what many suspected but few could prove: every large language model fabricates answers, and more context makes it worse.

The paper — "How Much Do LLMs Hallucinate in Document Q&A Scenarios?" by Roig (March 2026) — tested 35 open-weight models across 172 billion tokens of evaluation. Not a benchmark. Not a leaderboard. A ground-truth-first methodology called RIKER that generates documents from known facts, then asks models about them. No contamination. No LLM-as-judge bias. Just deterministic scoring at massive scale.

The Numbers

35 Models tested

172B Tokens evaluated

1.19% Best model fabrication rate

>10% All models at 200K context

Let's unpack what that means:

Condition	Fabrication Rate	What it means
Best model, 32K context	1.19%	1 in 84 answers fabricated — under perfect conditions nobody uses
Top-tier models, typical use	5–7%	1 in 15–20 answers fabricated
Median of all 35 models	~25%	1 in 4 answers fabricated
Any model at 200K context	>10%	Every model exceeds 10% — context window = fabrication amplifier

That last row is the killer. The industry is racing to build longer context windows — 200K, 1M, 2M tokens. This study shows that longer context doesn't mean better answers. It means more fabrication.

The Decoupling Problem

The study's most important finding isn't the fabrication rates. It's this:

"Grounding ability and fabrication resistance are distinct capabilities — models that excel at finding facts may still fabricate facts that do not exist."

Read that again. A model can be excellent at retrieving information from documents and simultaneously excellent at inventing information that doesn't exist. These are separate skills. You can't test for one and assume the other.

This breaks the "just give it the documents" argument. RAG doesn't solve hallucination. Context stuffing doesn't solve hallucination. The model will find your real facts AND confidently fabricate fake ones, and you can't tell the difference from the output alone.

Why This Matters for Agent Systems

If a chatbot hallucinates, a human might catch it. If an autonomous agent hallucinates, it acts on it.

Consider the agentic commerce stack being built on Base and Ethereum right now. Agents are getting wallets (ERC-8004), payment rails (x402), and commerce protocols (ERC-8183). They can negotiate, pay, and settle — autonomously.

Now combine that with 5–7% fabrication rates. An agent processes a contract, fabricates a clause that doesn't exist, and executes payment based on it. No human in the loop. The money moves.

The stack verifies identity, authentication, and payment. Nobody verifies the output.

That's the gap this study quantifies — and exactly why output verification isn't optional for autonomous systems.

What Doesn't Work

Temperature tuning

The study tested four temperatures (0.0, 0.4, 0.7, 1.0). T=0.0 gives best accuracy only ~60% of the time. And it increases coherence loss (infinite loops) by up to 48× compared to T=1.0. There's no magic temperature setting.

Bigger models

Model family predicts fabrication resistance better than model size. The overall accuracy range spans 72 percentage points across the 35 models. Size isn't the answer — architecture and training are.

Better hardware

Results were consistent across NVIDIA H200, AMD MI300X, and Intel Gaudi3. Hardware doesn't change fabrication behavior.

Single-model voting

Running the same model 5 times (like Anthropic's multi-agent code review) doesn't help. If the model has a systematic fabrication pattern — and this study shows they do — you'll get 5 copies of the same fabrication. Common-mode failure.

What Does Work

Cross-model verification.

The study shows that fabrication patterns are model-family-specific. Different model families fabricate differently. A fact that GPT-5 fabricates, DeepSeek may correctly refuse. A hallucination from Llama, Qwen may catch.

This is exactly the thesis behind multi-model verification: if fabrication is model-specific, then independent verification across model families is the only scalable defense. Not because any single verifier is perfect — but because their failure modes are uncorrelated.

ThoughtProof's approach: Every claim is verified by 3+ models from different provider families. We measure agreement (MDI — Model Diversity Index) and only attest when independent models converge. If they disagree, that's a signal, not a bug.

pot-sdk (MIT) · Verification API · Agentic Commerce

The Uncomfortable Math

Let's do the math for an enterprise deploying a top-tier model at 128K context:

At ~15% fabrication rate (128K, top-tier model from the study):

1,000 queries/day → ~150 fabricated answers/day
10,000 queries/day → ~1,500 fabricated answers/day
Each one indistinguishable from a correct answer in the output

For a compliance tool, a medical system, or an agent executing financial transactions — that's not an acceptable error rate. That's a lawsuit rate.

The Structural Problem

Why can't the big AI providers solve this? Because of a structural conflict of interest:

Provider	Approach	Problem
Anthropic	5× Claude agents	Same model, same failure modes
xAI	Multiple Grok instances	Same model, same failure modes
OpenAI	o-series "self-checking"	Same model, same failure modes

Every provider's incentive is to sell more of their own tokens. Multi-model verification requires using competitors' models. No provider will build that. It has to come from a neutral layer.

Conclusion

172 billion tokens. 35 models. The conclusion is unambiguous:

Every model fabricates. No exceptions. The best achieves 1.19% under lab conditions nobody uses in production.
More context = more fabrication. The long-context-window race is a fabrication amplifier.
Grounding ≠ anti-fabrication. A model can find your documents AND invent facts that aren't in them.
Single-model verification is structurally broken. The providers can't fix it because their business model prevents it.

The question isn't whether your AI will hallucinate. It's whether you'll know when it does.

Paper: Roig, "How Much Do LLMs Hallucinate in Document Q&A Scenarios?" arXiv:2603.08274, March 2026

ThoughtProof is the epistemic verification layer for AI systems. Multi-model consensus with cryptographic receipts.

GitHub · npm · API · Agentic Commerce