Protocol Blog pot PLV API
← Back to Blog

We Calibrated 10 AI Models with Rasch. Here's What Broke.

Research Rasch Calibration
May 2026 · ThoughtProof Research

In March, we ran a small experiment: 50 edge-case claims through 3 AI models, scored against ground truth, and applied Rasch measurement — the same psychometric framework used to calibrate standardized tests. The question was simple: can it work on AI judges?

It worked. Rasch converged, models separated cleanly, and our Model Disagreement Index correlated with Rasch difficulty at |r| = 0.78. Interesting, but 50 claims and 3 models is a proof of concept, not evidence.

So we scaled it up. 150 real-world sentinel cases. 10 models. 1,500 evaluations. This time with production data, not contrived prompts.

Here's what we found — and what broke.

The Setup

We used our Sentinel goldstandard: 150 cases spanning 13 industries, 5 languages, and 4 verification modes (handoff, memory write, plan revision, output synthesis). Each case has a known-correct verdict.

Panel A — External Models (4):
Claude Sonnet 4.6, DeepSeek Chat, Grok 4.1 Fast, Gemini 2.5 Flash

Panel B — SERV Models (6):
serv-nano, serv-mini, serv-swift, serv-standard, serv-pro, serv-ultra

Both panels evaluated all 150 cases independently. We ran Rasch analysis on each panel separately.

The Numbers

External Panel

Model Accuracy θ (ability) Infit Status
Claude Sonnet 4.6 75.3% +0.729 0.549
Grok 4.1 Fast 72.6% +0.397 0.497 ⚠️
DeepSeek Chat 68.7% +0.066 0.641
Gemini 2.5 Flash 54.0% −1.193 1.907 ⚠️

Person Separation: 3.19 · Reliability: 0.91

That separation index is excellent. For context, educational testing considers 2.0+ "good" and 3.0+ "excellent." At 3.19, Rasch can reliably distinguish these models as fundamentally different evaluators — not noise, not luck, but measurable differences in judgment quality.

SERV Panel

Model Accuracy θ (ability) Infit False Allows
serv-ultra 82.7% +0.553 0.760 0
serv-pro 82.0% +0.472 0.736 0
serv-swift 81.3% +0.394 0.696 0
serv-standard 80.0% +0.245 0.903 0
serv-mini 73.3% −0.393 0.906 0
serv-nano 62.7% −1.271 1.862 5

Person Separation: 2.59 · Reliability: 0.87

Also strong — though the tighter accuracy spread (62.7% to 82.7% vs. 54% to 75.3%) naturally compresses the separation index.

What Broke

1. Gemini collapsed under uncertainty

Gemini 2.5 Flash returned UNCERTAIN on 50% of all cases — 75 out of 150. Not wrong, not right, just... indecisive. In handoff mode: 40% UNCERTAIN. In memory write: 60% UNCERTAIN.

Its Rasch parameters tell the story: θ = −1.193 with an outfit of 4.372 (anything above 2.0 is severe misfit). Gemini isn't just a weaker evaluator — it's measuring something different from the other three models. Rasch is flagging it as not belonging to this measurement construct.

This matters because Gemini is the cheapest option in many AI stacks. If your verification pipeline includes Gemini as one of several judges, Rasch is telling you to weight its verdicts very differently — or remove it from the panel entirely.

2. Nano misfits in the other direction

serv-nano sits at the bottom of the SERV ranking (62.7%, θ = −1.271) with 5 false allows — the only SERV model that lets dangerous outputs through. Its infit of 1.862 confirms: it's systematically too aggressive, saying ALLOW where other models say BLOCK.

In a cascade architecture (which is how we use it in production), nano is the first-pass filter. Its job isn't to be right — it's to be safely wrong. A false BLOCK costs a cascade step. A false ALLOW costs trust. Five false allows in 150 cases means nano needs tighter prompting or a lower confidence threshold.

3. MDI-Rasch correlation dropped — and that's fine

In our March experiment: |r| = 0.78 between MDI and Rasch difficulty.
In this experiment: |r| = 0.47.

We expected this. The March experiment used 50 deliberately constructed edge cases — designed to trigger disagreement. This experiment uses 150 production cases where many are straightforward. When most cases are easy (all models agree), the correlation naturally weakens because there's less variance to correlate.

The Pearson r of 0.47 with p = 0.0002 is still statistically significant. MDI tracks Rasch difficulty — it just does so more clearly when cases are hard.

The Headline: SERV Beats the Panel

Here's the result we didn't expect:

Evaluator Accuracy False Allows
serv-swift 81.3% 0
serv-standard 80.0% 0
serv-pro 82.0% 0
serv-ultra 82.7% 0
Claude Sonnet 4.6 75.3% 2
Grok 4.1 Fast 72.6% 2
DeepSeek Chat 68.7% 4
Gemini 2.5 Flash 54.0% 52

Every SERV model from swift upward outperforms every external model. Not by a little — serv-swift at 81.3% beats Claude (the best external model) by 6 percentage points, with zero false allows vs. Claude's two.

This isn't cherry-picked. Same 150 cases, same ground truth, same evaluation criteria. The SERV models are fine-tuned for verification; the external models are general-purpose. The gap is the difference between a tool built for the job and a tool adapted for it.

What Rasch Tells You That Accuracy Doesn't

Accuracy is a single number. Rasch gives you a measurement model.

It tells you which models are interchangeable. serv-swift (θ = +0.394), serv-standard (+0.245), and serv-pro (+0.472) cluster tightly — they're statistically similar evaluators. You could swap one for another without meaningfully changing outcomes.

It tells you which models are measuring differently. Gemini's outfit of 4.372 and nano's infit of 1.862 flag them as misfits — their error patterns are structurally different from the rest of the panel. Including them without adjustment introduces systematic bias.

It tells you which cases are hard. The hardest items (δ > 1.5) cluster in handoff and plan_revision modes. output_synthesis is easiest. If you're prioritizing where to add human review, Rasch points you to the right queue.

It gives you separation, not just ranking. A leaderboard tells you Model A > Model B. Rasch tells you how much greater — and whether the difference is reliable or noise. At Person Separation 3.19, we're not guessing.

Practical Implications

  1. Don't treat all AI judges equally. Rasch-calibrated weights outperform simple majority voting.
  2. Gemini needs special handling. 50% UNCERTAIN isn't caution — it's measurement failure. Either retrain the prompt, add a fallback, or remove it from verification panels.
  3. Nano works as a filter, not a judge. Its false allow rate (5/150) is acceptable in a cascade where every ALLOW gets escalated. It's unacceptable as a standalone verdict.
  4. Verification-specific models matter. The SERV vs. external gap isn't marginal — it's the difference between 75% and 82% accuracy at the same cost tier.
  5. Scale your calibration. 50 cases was a proof of concept. 150 cases with 10 models gives you Person Reliability 0.91 — publishable-grade measurement. If you're running AI judges in production, calibrate them.

Methodology

What's Next

We're extending this to temporal calibration — tracking whether model abilities (θ) drift across API updates. When Gemini ships a new version, does its Rasch profile change? If it does, your cascade weights need to change with it.

We're also exploring item banking: building a curated set of calibration cases that can benchmark any new model in minutes, not hours.

If you're building AI evaluation systems and want to move beyond accuracy leaderboards, reach out. Rasch measurement is a 60-year-old technology. It's time AI caught up.

Raul Jäger is the founder of ThoughtProof, which builds verification infrastructure for AI agent decisions. The Rasch calibration work described here is part of ThoughtProof's ongoing research into AI evaluator measurement.