Research March 11, 2026 · 8 min read

MDI Meets Rasch: Psychometric Validation of Multi-Model Verification

Someone asked if our Model Diversity Index is a "real" measure. So we tested it against the gold standard of measurement science. 100 prompts. 3 models from 3 independent providers. Here's what we found.

"Most ML metrics fail the most basic tests of real measurement — they're not linear, they're not unbiased, they're not traceable to standards."

— Dr. Matt Barney, on the ERC-8183 Ethereum Magicians thread

The Challenge

Dr. Barney is right. Most AI evaluation scores are not measures in any rigorous sense. They're heuristics — useful, but not calibrated against measurement theory.

ThoughtProof's Model Diversity Index (MDI) quantifies cross-model consensus. When 3+ independent AI models from different providers agree on a verification, MDI is high. When they disagree, MDI drops. The question: does MDI actually measure what we claim it measures?

The Rasch Model is the gold standard in psychometrics for evaluating whether a measurement instrument produces real, linear, unbiased measures. It was developed by Georg Rasch in 1960 and is used to validate everything from standardized tests to clinical assessments.

So we ran the experiment.

The Experiment

Two benchmarks, 100 prompts total, verified by 3 frontier models independently:

Benchmark V1: Clear Cases

50 prompts: 20 correct facts, 15 hallucinations, 15 adversarial claims

Benchmark V2: Edge Cases

50 prompts: precision-dependent, outdated, contested science, framing-dependent, statistical claims

Verifiers: Claude Sonnet 4.6 (Anthropic), GPT-5.4 (OpenAI), DeepSeek Chat (DeepSeek). Three models from three independent providers. Each verified independently — no shared context, no chain-of-thought leaking. A fourth model (Grok) was planned but its API key expired during the run.

Note on "frontier": DeepSeek Chat is a capable model but not frontier-class. We chose it for provider diversity (Chinese lab vs. US labs), not because it matches Claude or GPT-5.4 in raw capability. This matters for the Rasch analysis — a weaker model might introduce noise rather than signal.

Result 1: Clear Cases — Too Much Agreement

On the first benchmark, 49 out of 50 prompts showed unanimous agreement. All correct claims confirmed. All hallucinations and adversarial claims rejected. Across all three models.

The Rasch model couldn't fit — there wasn't enough variance. When models agree this strongly, there's nothing to calibrate against.

The one disagreement: "The EU AI Act entered into force in August 2024." Claude confirmed (technically correct — it was August 1). GPT-5.4 and DeepSeek rejected (imprecise formulation). The disagreement wasn't about facts but about precision thresholds.

This is actually a strong result: when the input is clear, MDI=1.0 and all models agree. MDI correctly signals "no review needed."

Result 2: Edge Cases — Rasch Converges

The second benchmark was designed to produce disagreement. 50 "gray zone" prompts where reasonable models could disagree:

Precision-dependent: "Light travels at 300,000 km/s" (actually 299,792)
Outdated: "China has the world's largest population" (India overtook in 2023)
Contested: "Moderate red wine has cardiovascular benefits" (was consensus, now disputed)
Framing: "Nuclear energy produces zero carbon emissions" (direct yes, lifecycle no)
Statistical: "Countries with more chocolate have more Nobel laureates" (real correlation, no causation)

Results:

        Unanimous CONFIRM: 10

        Unanimous REJECT:  30

        Disagreement:      10

        Rasch converged:   ✅ Yes

10 prompts with genuine disagreement — enough variance for a Rasch fit.

The Rasch Numbers

Model Strictness (β)

Rasch separates "item difficulty" from "person ability." In our framing: model strictness (β) measures how likely a model is to REJECT, independent of the prompt.

Model	β (strictness)	Confirm Rate	Infit MNSQ
Claude Sonnet 4.6	−1.66	36%	0.67 ✓
GPT-5.4	+0.98	26%	1.07 ✓
DeepSeek Chat	+0.98	26%	1.07 ✓

Claude is systematically more tolerant. It accepts approximations ("light travels at 300,000 km/s") that GPT-5.4 and DeepSeek reject as imprecise. This isn't a bug — it's a measurable, stable personality difference between model families.

All three models show good Rasch fit (Infit MNSQ between 0.5–1.5). This means the Rasch model adequately describes their response patterns.

MDI ↔ Rasch Correlation

The key question: does MDI track Rasch-measured item difficulty?

r = −0.78

MDI ↔ |Rasch θ| correlation (strong)

Strong negative correlation (|r| = 0.78). The sign is negative because MDI and |θ| move in opposite directions: when MDI drops (disagreement), Rasch θ magnitude increases (the prompt is at a measurement boundary). When MDI is high (consensus), the prompt sits comfortably within the measurement range. This is the expected direction.

In plain English: MDI and Rasch track the same underlying construct — the degree to which a claim sits at the boundary of verifiability.

What the Disagreements Reveal

The 10 disagreement cases fall into clear categories:

Precision tolerance (P01, P03, P07, P10): Claude accepts common approximations. GPT-5.4 demands exact figures. Neither is "wrong" — they have different precision thresholds.

Temporal knowledge (T05, T06, T08, T10): Models differ on when a "currently true" claim becomes outdated. DeepSeek still thinks China has the largest population (India overtook in 2023).

Correlation vs. causation (X01): Claude confirms a documented correlation. GPT-5.4 and DeepSeek reject it as misleading. This is a genuine philosophical disagreement about what "correct" means.

Every disagreement case is one where a human reviewer would add genuine value. These aren't bugs in the verification system — they're the system correctly identifying claims that need human judgment.

The Honest Gaps

We want to be direct about what this experiment does and doesn't show:

N=10 is thin. Ten disagreement items is a very small dataset for psychometric validation. Standard Rasch studies use 20+ items minimum. Person Separation = 1.24 (below the 2.0 threshold) and Person Reliability = 0.61 (below 0.80) both confirm this. A peer reviewer would flag the sample size immediately.
GPT-5.4 and DeepSeek have near-identical strictness (β = +0.98). They agreed on 7 out of 10 edge cases. We effectively measured two distinct strictness levels, not three. This weakens the "independent verification" claim — though they did disagree on 3 prompts (P04, P07, T05), so they're not identical.
We designed the edge-case prompts ourselves. Knowing what kinds of claims split models introduces selection bias. A cleaner experiment would use prompts from an external source (e.g., a fact-checking dataset) where we can't predict outcomes.
One "disagreement" is a factual error. T05 ("China has the world's largest population") — DeepSeek confirmed this despite India overtaking China in 2023. This is a model knowledge gap, not a legitimate precision boundary. It still produces Rasch variance, but it's less impressive than a genuine philosophical disagreement.
Same system prompt for all models. All three received identical verification instructions. This could induce correlated behavior (shared prompt bias). True independence would mean varying the prompting strategy per model — different phrasing, different rubrics.
3 models is a minimum. With more verifiers from more families (Mistral, Cohere, Gemini, Grok), Rasch would have more "items" and better separation. Standard Rasch needs 20+ items for stable parameter estimates.
MDI is not Rasch-linear. The correlation is encouraging but MDI was designed as a consensus signal, not a measurement instrument. Making MDI formally Rasch-calibrated would require redesigning it around logistic item-response functions.

What This Means

MDI is not a Rasch measure. It wasn't designed to be one, and it doesn't need to be one for most use cases.

But MDI is empirically correlated with Rasch measurement (r = 0.78). It tracks the same underlying construct — the "verifiability boundary" of a claim. When MDI says "models disagree," Rasch confirms that the claim is genuinely difficult to measure. When MDI says "strong consensus," Rasch confirms there's nothing interesting to calibrate.

For ERC-8183 Evaluators, this means:

High MDI → auto-complete. Models agree, Rasch confirms it's unambiguous. Release the escrow.
Low MDI → human review. Models disagree, Rasch confirms it's a boundary case. Flag for arbitration.
Model strictness is measurable. Operators can choose verifier panels calibrated to their risk tolerance — lenient (Claude-style) or strict (GPT-style).

Reproduce It

All data and code:

100 prompts (50 clear + 50 edge cases)
300 raw verification results (per-verifier verdicts + reasons + confidence)
Rasch analysis scripts (Python, scipy JMLE)
Available on request: raul@thoughtproof.ai

This experiment was prompted by Dr. Matt Barney's comment on the ERC-8183 Ethereum Magicians thread. We appreciate the challenge — it made MDI better understood.