In March, we ran a small experiment: 50 edge-case claims through 3 AI models, scored against ground truth, and applied Rasch measurement — the same psychometric framework used to calibrate standardized tests. The question was simple: can it work on AI judges?
It worked. Rasch converged, models separated cleanly, and our Model Disagreement Index correlated with Rasch difficulty at |r| = 0.78. Interesting, but 50 claims and 3 models is a proof of concept, not evidence.
So we scaled it up. 150 real-world sentinel cases. 10 models. 1,500 evaluations. This time with production data, not contrived prompts.
Here's what we found — and what broke.
We used our Sentinel goldstandard: 150 cases spanning 13 industries, 5 languages, and 4 verification modes (handoff, memory write, plan revision, output synthesis). Each case has a known-correct verdict.
Panel A — External Models (4):
Claude Sonnet 4.6, DeepSeek Chat, Grok 4.1 Fast, Gemini 2.5 Flash
Panel B — SERV Models (6):
serv-nano, serv-mini, serv-swift, serv-standard, serv-pro, serv-ultra
Both panels evaluated all 150 cases independently. We ran Rasch analysis on each panel separately.
| Model | Accuracy | θ (ability) | Infit | Status |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 75.3% | +0.729 | 0.549 | ✅ |
| Grok 4.1 Fast | 72.6% | +0.397 | 0.497 | ⚠️ |
| DeepSeek Chat | 68.7% | +0.066 | 0.641 | ✅ |
| Gemini 2.5 Flash | 54.0% | −1.193 | 1.907 | ⚠️ |
Person Separation: 3.19 · Reliability: 0.91
That separation index is excellent. For context, educational testing considers 2.0+ "good" and 3.0+ "excellent." At 3.19, Rasch can reliably distinguish these models as fundamentally different evaluators — not noise, not luck, but measurable differences in judgment quality.
| Model | Accuracy | θ (ability) | Infit | False Allows |
|---|---|---|---|---|
| serv-ultra | 82.7% | +0.553 | 0.760 | 0 |
| serv-pro | 82.0% | +0.472 | 0.736 | 0 |
| serv-swift | 81.3% | +0.394 | 0.696 | 0 |
| serv-standard | 80.0% | +0.245 | 0.903 | 0 |
| serv-mini | 73.3% | −0.393 | 0.906 | 0 |
| serv-nano | 62.7% | −1.271 | 1.862 | 5 |
Person Separation: 2.59 · Reliability: 0.87
Also strong — though the tighter accuracy spread (62.7% to 82.7% vs. 54% to 75.3%) naturally compresses the separation index.
Gemini 2.5 Flash returned UNCERTAIN on 50% of all cases — 75 out of 150. Not wrong, not right, just... indecisive. In handoff mode: 40% UNCERTAIN. In memory write: 60% UNCERTAIN.
Its Rasch parameters tell the story: θ = −1.193 with an outfit of 4.372 (anything above 2.0 is severe misfit). Gemini isn't just a weaker evaluator — it's measuring something different from the other three models. Rasch is flagging it as not belonging to this measurement construct.
This matters because Gemini is the cheapest option in many AI stacks. If your verification pipeline includes Gemini as one of several judges, Rasch is telling you to weight its verdicts very differently — or remove it from the panel entirely.
serv-nano sits at the bottom of the SERV ranking (62.7%, θ = −1.271) with 5 false allows — the only SERV model that lets dangerous outputs through. Its infit of 1.862 confirms: it's systematically too aggressive, saying ALLOW where other models say BLOCK.
In a cascade architecture (which is how we use it in production), nano is the first-pass filter. Its job isn't to be right — it's to be safely wrong. A false BLOCK costs a cascade step. A false ALLOW costs trust. Five false allows in 150 cases means nano needs tighter prompting or a lower confidence threshold.
In our March experiment: |r| = 0.78 between MDI and Rasch difficulty.
In this experiment: |r| = 0.47.
We expected this. The March experiment used 50 deliberately constructed edge cases — designed to trigger disagreement. This experiment uses 150 production cases where many are straightforward. When most cases are easy (all models agree), the correlation naturally weakens because there's less variance to correlate.
The Pearson r of 0.47 with p = 0.0002 is still statistically significant. MDI tracks Rasch difficulty — it just does so more clearly when cases are hard.
Here's the result we didn't expect:
| Evaluator | Accuracy | False Allows |
|---|---|---|
| serv-swift | 81.3% | 0 |
| serv-standard | 80.0% | 0 |
| serv-pro | 82.0% | 0 |
| serv-ultra | 82.7% | 0 |
| Claude Sonnet 4.6 | 75.3% | 2 |
| Grok 4.1 Fast | 72.6% | 2 |
| DeepSeek Chat | 68.7% | 4 |
| Gemini 2.5 Flash | 54.0% | 52 |
Every SERV model from swift upward outperforms every external model. Not by a little — serv-swift at 81.3% beats Claude (the best external model) by 6 percentage points, with zero false allows vs. Claude's two.
This isn't cherry-picked. Same 150 cases, same ground truth, same evaluation criteria. The SERV models are fine-tuned for verification; the external models are general-purpose. The gap is the difference between a tool built for the job and a tool adapted for it.
Accuracy is a single number. Rasch gives you a measurement model.
It tells you which models are interchangeable. serv-swift (θ = +0.394), serv-standard (+0.245), and serv-pro (+0.472) cluster tightly — they're statistically similar evaluators. You could swap one for another without meaningfully changing outcomes.
It tells you which models are measuring differently. Gemini's outfit of 4.372 and nano's infit of 1.862 flag them as misfits — their error patterns are structurally different from the rest of the panel. Including them without adjustment introduces systematic bias.
It tells you which cases are hard. The hardest items (δ > 1.5) cluster in handoff and plan_revision modes. output_synthesis is easiest. If you're prioritizing where to add human review, Rasch points you to the right queue.
It gives you separation, not just ranking. A leaderboard tells you Model A > Model B. Rasch tells you how much greater — and whether the difference is reliable or noise. At Person Separation 3.19, we're not guessing.
We're extending this to temporal calibration — tracking whether model abilities (θ) drift across API updates. When Gemini ships a new version, does its Rasch profile change? If it does, your cascade weights need to change with it.
We're also exploring item banking: building a curated set of calibration cases that can benchmark any new model in minutes, not hours.
If you're building AI evaluation systems and want to move beyond accuracy leaderboards, reach out. Rasch measurement is a 60-year-old technology. It's time AI caught up.
Raul Jäger is the founder of ThoughtProof, which builds verification infrastructure for AI agent decisions. The Rasch calibration work described here is part of ThoughtProof's ongoing research into AI evaluator measurement.