A team at UVA and NYU just published what might be the most important LLM evaluation of 2026. They tested 35 language models — from GPT-4.1 to Claude to DeepSeek to Llama-4 — on a deceptively simple task: when a value changes, can you remember the new one?
Every single model fails.
The paper is called "Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length", and it borrows a concept from cognitive psychology that every medical professional knows: proactive interference — when old memories disrupt recall of newer information.
The setup is elegant. Stream a sequence of key-value updates to a model — like a nurse updating patient vitals throughout a shift. Key A gets value 1, then value 2, then value 3. At the end, ask: what's the current value of Key A?
The correct answer is always the last value presented. No search required. No reasoning. Just: what did you see most recently?
With 3 updates per key, most models score perfectly. But as updates accumulate — 12, 48, 97, 400 — accuracy collapses on a log-linear curve. Not linearly. Logarithmically. The interference compounds.
At 400 updates per key (46 keys tracked), here's how the best models perform:
| Model | Accuracy | IES Score |
|---|---|---|
| Claude 3.5 Sonnet | ~60% | 0.85 |
| Grok-3 Beta (MoE) | ~58% | 0.78 |
| GPT-4.1 Mini | ~55% | 0.75 |
| GPT-4o | ~52% | 0.70 |
| GPT-4.1 | ~50% | 0.68 |
| DeepSeek V3 | ~20% | 0.38 |
| Llama-4 Maverick | ~18% | 0.35 |
| DeepSeek R1 (CoT) | ~8% | 0.22 |
| Llama-4 Scout | ~0% | 0.03 |
Read that last row again. Llama-4 Scout — Meta's newest model — returns the correct current value 0% of the time when there's meaningful update history.
The researchers ran a regression. Context length vs. interference resistance: p = 0.886. Completely non-significant. A model with 1M tokens of context is no better at distinguishing old from new than one with 128K. The information is in the context. The model just can't tell which version is current.
In 3 out of 4 comparisons, CoT models performed equal to or worse than their base versions. DeepSeek R1 (with reasoning) scores 8% where DeepSeek V3 (without reasoning) scores 20%. Reasoning about confused memories doesn't unconfuse them.
Explicitly telling models to "ignore earlier values" and "only report the most recent update" yields, in the paper's words, "limited success." The interference is below the level that instructions can reach.
MoE architectures — where only a subset of parameters activate per token — consistently underperform dense models of comparable size. Qwen 2.5 72B (dense) beats Qwen3 235B (MoE). Llama-3.1 405B (dense) beats Llama-4 Maverick 400B (MoE). More parameters doesn't mean more working memory if most of them aren't active.
Consider a real scenario: a patient's blood pressure readings over a hospital stay.
08:00 — BP 142/88 (elevated)
12:00 — BP 128/82 (responding to medication)
16:00 — BP 118/76 (normalized)
20:00 — BP 156/94 (acute spike)
An AI summarizing this patient's record needs to report the 20:00 reading — the current, clinically urgent value. But proactive interference means the model is systematically biased toward earlier readings. It might report 128/82 (the "typical" value it saw most often in similar ranges) instead of the critical spike.
This isn't a hallucination. The model isn't making up a blood pressure. It's returning a real value it actually saw — just the wrong one. The old one. And that's arguably more dangerous than a hallucination, because it looks correct.
Here's what the paper doesn't discuss, but what follows directly from their findings:
If proactive interference is model-specific — varying by architecture, training, and parameter count — then different models will confuse different values in different ways. Claude's interference pattern differs from GPT's differs from DeepSeek's.
This is exactly the scenario multi-model verification was designed for.
How ThoughtProof catches interference errors:
Three models independently process the same patient record. Claude returns BP 128/82. GPT-4.1 returns BP 156/94. DeepSeek returns BP 156/94.
The MDI (Model Diversity Index) detects disagreement. The system flags the output for review instead of silently passing an outdated value downstream.
No single model can detect its own interference. But cross-model disagreement reveals it.
The paper's own data supports this. The IES (Interference Endurance Score) rankings show that models fail at different rates and different thresholds. Claude maintains 60% accuracy where DeepSeek drops to 20%. That gap is the signal. If they agreed, you'd trust the answer. When they disagree, you know to check.
The researchers introduce the Interference Endurance Score (IES) — the area under the accuracy curve as interference increases. It's effectively a "working memory capacity" metric for LLMs.
Key finding: only parameter count predicts IES (p = 0.005). Not context length (p = 0.886). Not architecture type. Not reasoning capability. Raw parameter count is the only significant predictor of interference resistance.
This means you can't fix the problem by giving the model more context. You can't fix it by asking it to think harder. You can only partially mitigate it by using bigger models — and even the biggest (Claude 3.5 Sonnet, ~200B+ parameters) still drops to 60% accuracy under heavy interference.
This isn't just about chatbots. Consider autonomous agents that:
In all these cases, the agent's context accumulates historical values that create interference. And the paper proves: no amount of context window or reasoning eliminates this.
ThoughtProof's verification API sends every output through 3+ independent models from different providers. When proactive interference causes one model to return a stale value, the disagreement surfaces immediately.
We don't claim to detect every interference error. But we make interference visible instead of silent. And in healthcare, finance, and legal — visible errors are survivable. Silent ones aren't.
The core insight: Proactive interference is model-specific. Multi-model verification exploits that specificity to catch errors no single model can detect on its own.
Paper: Arani et al. (2026), "Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length", UVA / NYU.
ThoughtProof is an open-source epistemic verification protocol. pot-sdk on GitHub · Verification API · Agentic Commerce