Yesterday, Anthropic launched Code Review for Claude Code — a multi-agent system that dispatches a team of AI agents to review every pull request. It's impressive engineering. Reviews take ~20 minutes, cost $15–25, and Anthropic reports less than 1% of findings are marked incorrect by engineers.
We agree with the direction. We disagree with the implementation.
Multi-agent review is exactly what we've been building at ThoughtProof. The idea is sound:
Anthropic's data backs it up: before Code Review, 16% of PRs got substantive review comments. Now 54% do. That's a real improvement.
Here's what Anthropic doesn't address: all five agents are Claude.
Same model. Same training data. Same architectural biases. Same blind spots.
This is what reliability engineers call common-mode failure — when redundant systems share a failure cause that defeats all of them simultaneously.
In aviation, you don't put five identical sensors on a plane. You use sensors from different manufacturers, based on different physical principles. Because if one has a systematic error, they all do.
The same principle applies to AI verification.
Imagine a subtle logic bug that Claude's training data doesn't cover well — say, a race condition in a specific concurrency pattern, or an authentication bypass that relies on an unusual protocol interaction.
Five Claude instances will likely miss it the same way. Not because any individual agent is bad, but because they share the same systematic blind spot. Running the same model five times doesn't create independence — it creates the illusion of independence.
Anthropic's own research validates this concern. Their documentation on multi-agent orchestration explicitly describes the "Parallelization/Voting" pattern and notes that true independence requires diversity in the verification pipeline.
At ThoughtProof, we verify outputs using models from different providers — different training data, different architectures, different failure modes.
This isn't a critique of one company. It's a structural problem across the entire industry.
| Company | Product | Multi-Agent | Multi-Model |
|---|---|---|---|
| Anthropic | Code Review | ✅ 5 agents | ❌ All Claude |
| xAI | Grok Agents | ✅ Multiple | ❌ All Grok |
| OpenAI | Various tools | ✅ Multiple | ❌ All GPT |
| ThoughtProof | Verification SDK | ✅ 3+ verifiers | ✅ Different providers |
Every major AI provider builds multi-agent systems using exclusively their own model. This is logical from their perspective — they want to sell their own tokens. Why would Anthropic integrate Grok? Why would xAI call Claude?
But this means none of them can solve common-mode failure. It's not a technical limitation — it's a business model conflict. They are structurally incapable of building model-diverse verification because it works against their commercial interests.
ThoughtProof doesn't have this conflict. We're the neutral verification layer that checks all models against each other.
When GPT disagrees with Claude disagrees with Grok, that disagreement itself is signal. It means the answer is genuinely uncertain — and a human should look.
When five Claudes agree, you don't know if they're right or if they're all wrong in the same way.
Anthropic built this for code review. But the principle applies everywhere AI outputs matter:
In these domains, common-mode failure isn't an inconvenience — it's a liability.
Anthropic validated that multi-agent verification is the future of trustworthy AI. We agree.
But multi-agent isn't enough. Multi-model is what makes verification actually independent.
No major AI provider can solve this problem — because it runs against their business model.
Five copies of the same model isn't consensus. It's an echo chamber.
ThoughtProof is an open-source epistemic verification protocol. Our SDK (pot-sdk) enables multi-model verification for any AI output. Try it:
npm install pot-sdk
Built by a dentist who got tired of trusting single models, and an AI security researcher who kept finding ways to break them.