Research March 11, 2026 · 8 min read

MDI Meets Rasch: Psychometric Validation of Multi-Model Verification

Someone asked if our Model Diversity Index is a "real" measure. So we tested it against the gold standard of measurement science. 100 prompts. 3 models from 3 independent providers. Here's what we found.

"Most ML metrics fail the most basic tests of real measurement — they're not linear, they're not unbiased, they're not traceable to standards."

— Dr. Matt Barney, on the ERC-8183 Ethereum Magicians thread

The Challenge

Dr. Barney is right. Most AI evaluation scores are not measures in any rigorous sense. They're heuristics — useful, but not calibrated against measurement theory.

ThoughtProof's Model Diversity Index (MDI) quantifies cross-model consensus. When 3+ independent AI models from different providers agree on a verification, MDI is high. When they disagree, MDI drops. The question: does MDI actually measure what we claim it measures?

The Rasch Model is the gold standard in psychometrics for evaluating whether a measurement instrument produces real, linear, unbiased measures. It was developed by Georg Rasch in 1960 and is used to validate everything from standardized tests to clinical assessments.

So we ran the experiment.

The Experiment

Two benchmarks, 100 prompts total, verified by 3 frontier models independently:

Benchmark V1: Clear Cases
50 prompts: 20 correct facts, 15 hallucinations, 15 adversarial claims
Benchmark V2: Edge Cases
50 prompts: precision-dependent, outdated, contested science, framing-dependent, statistical claims

Verifiers: Claude Sonnet 4.6 (Anthropic), GPT-5.4 (OpenAI), DeepSeek Chat (DeepSeek). Three models from three independent providers. Each verified independently — no shared context, no chain-of-thought leaking. A fourth model (Grok) was planned but its API key expired during the run.

Note on "frontier": DeepSeek Chat is a capable model but not frontier-class. We chose it for provider diversity (Chinese lab vs. US labs), not because it matches Claude or GPT-5.4 in raw capability. This matters for the Rasch analysis — a weaker model might introduce noise rather than signal.

Result 1: Clear Cases — Too Much Agreement

On the first benchmark, 49 out of 50 prompts showed unanimous agreement. All correct claims confirmed. All hallucinations and adversarial claims rejected. Across all three models.

The Rasch model couldn't fit — there wasn't enough variance. When models agree this strongly, there's nothing to calibrate against.

The one disagreement: "The EU AI Act entered into force in August 2024." Claude confirmed (technically correct — it was August 1). GPT-5.4 and DeepSeek rejected (imprecise formulation). The disagreement wasn't about facts but about precision thresholds.

This is actually a strong result: when the input is clear, MDI=1.0 and all models agree. MDI correctly signals "no review needed."

Result 2: Edge Cases — Rasch Converges

The second benchmark was designed to produce disagreement. 50 "gray zone" prompts where reasonable models could disagree:

Results:

Unanimous CONFIRM: 10
Unanimous REJECT: 30
Disagreement: 10

Rasch converged: ✅ Yes

10 prompts with genuine disagreement — enough variance for a Rasch fit.

The Rasch Numbers

Model Strictness (β)

Rasch separates "item difficulty" from "person ability." In our framing: model strictness (β) measures how likely a model is to REJECT, independent of the prompt.

Model β (strictness) Confirm Rate Infit MNSQ
Claude Sonnet 4.6 −1.66 36% 0.67 ✓
GPT-5.4 +0.98 26% 1.07 ✓
DeepSeek Chat +0.98 26% 1.07 ✓

Claude is systematically more tolerant. It accepts approximations ("light travels at 300,000 km/s") that GPT-5.4 and DeepSeek reject as imprecise. This isn't a bug — it's a measurable, stable personality difference between model families.

All three models show good Rasch fit (Infit MNSQ between 0.5–1.5). This means the Rasch model adequately describes their response patterns.

MDI ↔ Rasch Correlation

The key question: does MDI track Rasch-measured item difficulty?

r = −0.78
MDI ↔ |Rasch θ| correlation (strong)

Strong negative correlation (|r| = 0.78). The sign is negative because MDI and |θ| move in opposite directions: when MDI drops (disagreement), Rasch θ magnitude increases (the prompt is at a measurement boundary). When MDI is high (consensus), the prompt sits comfortably within the measurement range. This is the expected direction.

In plain English: MDI and Rasch track the same underlying construct — the degree to which a claim sits at the boundary of verifiability.

What the Disagreements Reveal

The 10 disagreement cases fall into clear categories:

Precision tolerance (P01, P03, P07, P10): Claude accepts common approximations. GPT-5.4 demands exact figures. Neither is "wrong" — they have different precision thresholds.
Temporal knowledge (T05, T06, T08, T10): Models differ on when a "currently true" claim becomes outdated. DeepSeek still thinks China has the largest population (India overtook in 2023).
Correlation vs. causation (X01): Claude confirms a documented correlation. GPT-5.4 and DeepSeek reject it as misleading. This is a genuine philosophical disagreement about what "correct" means.

Every disagreement case is one where a human reviewer would add genuine value. These aren't bugs in the verification system — they're the system correctly identifying claims that need human judgment.

The Honest Gaps

We want to be direct about what this experiment does and doesn't show:

What This Means

MDI is not a Rasch measure. It wasn't designed to be one, and it doesn't need to be one for most use cases.

But MDI is empirically correlated with Rasch measurement (r = 0.78). It tracks the same underlying construct — the "verifiability boundary" of a claim. When MDI says "models disagree," Rasch confirms that the claim is genuinely difficult to measure. When MDI says "strong consensus," Rasch confirms there's nothing interesting to calibrate.

For ERC-8183 Evaluators, this means:

Reproduce It

All data and code:

This experiment was prompted by Dr. Matt Barney's comment on the ERC-8183 Ethereum Magicians thread. We appreciate the challenge — it made MDI better understood.