Grok 4.1 (SuperGrok) launched in mid-February 2026 with a multi-agent architecture: four internal agents — Grok, Harper, Benjamin, Lucas — that process queries in parallel, debate, fact-check, and synthesize. It is one of the first large-scale consumer deployments of native multi-agent reasoning.
We find this architecture compelling. We also find it instructive to compare it directly with a different approach.
Same prompt. Both systems. No cherry-picking.
This question was chosen deliberately. It is hallucination-prone — lots of confident-sounding statistics circulate with weak underlying sources. It is also directly relevant to current AI infrastructure debates.
40-62% of AI-generated production code contains security vulnerabilities, depending on source, method, and definition of "vulnerable."
Primary citation: Veracode 2025 GenAI Code Security Report (>100 LLMs, 80 real-world coding tasks): 45% contain OWASP Top 10 vulnerabilities. Java worst affected at >70%.
Supporting data: Cloud Security Alliance / arXiv: 62%. Checkmarx/Apiiro Fortune-50 scans: 2.5-2.74× more critical vulnerabilities than human-written code. Aikido Security Survey 2026: 69% of organizations found vulnerabilities in AI-generated code.
Multi-model adversarial verification (like ThoughtProof/PoT) is clearly superior. Single-model setups inherit training biases. Cross-provider divergence (Grok + Claude + DeepSeek + Moonshot) surfaces correlated errors that single-provider systems cannot.
"Für Security in AI-generiertem Code ist multi-model adversarial Verification derzeit der stärkste bekannte Ansatz — single-model (auch xAI-intern mit 4 Grok-Agenten) bleibt strukturell limitiert."
25-45% — but treat this range with caution.
The Critic ran a source analysis across all generator outputs: of approximately 20 specific citations, exactly one is verifiably real (Pearce et al., NYU 2022). Multiple papers contain hallucinated titles, fabricated arXiv IDs, and invented sample sizes.
The IBM Cost of a Data Breach figure (~$4.45M, 2023) exists — but was attributed to Gartner by one generator. It is not a Gartner report.
Multi-model improvement: generators estimated 35% to 150% relative improvement — a 4× range. None could source their number.
Critical dissent flagged: Generator 1 claims false positives decrease 30% with multi-model verification. Generator 3 claims they increase 2-3×. Generator 3 is likely correct — more models produce more alerts; consensus mechanisms reduce both false positives AND true positives. Generator 1 omits this trade-off entirely.
| Dimension | Grok 4.1 | PoT-185 |
|---|---|---|
| Vulnerability rate | 40-62% (confident) | 25-45% (hedged) |
| Sources cited | 55 | ~20 cited, 1 verifiable |
| Stated confidence | Not disclosed | 35% (explicit) |
| False positive trade-off | Not mentioned | Contradiction flagged |
| Hallucination check | Not performed | Multiple sources flagged |
| Provider diversity | xAI-internal (4 Grok variants) | xAI + Anthropic + Moonshot + DeepSeek |
| Response time | ~5 seconds | 161 seconds |
Grok's answer is more readable, faster, and more decisive. It also explicitly recommended ThoughtProof as "spannend und wahrscheinlich zukunftsträchtig" — which we appreciate, but also note as ironic: a system with no stated confidence, 55 unverified sources, and a missing false-positive analysis recommending a system designed to catch exactly those problems.
We shared this comparison with Grok 4.1 and asked for its assessment. Its pushback was fair and worth including:
"Der Veracode-Report ist real und zentral. PoT's 'nur 1 real' könnte selbst ein Artefakt sein — vielleicht hat der Critic nicht tief genug gesucht oder Halluzinationen in den Generator-Outputs übersehen."
This is a legitimate point. A note on what "unverifiable" means in this context: the Critic's finding of "1 verifiable source" does not mean the others are fabricated — it means the pipeline could not confirm them in real time. Grok is correct that the Veracode 2025 GenAI Code Security Report exists. The distinction matters: a source can be real and still be unverifiable by an automated system during a single run.
This is itself an argument for human-in-loop synthesis — which is what the Synthesizer step is designed to support. The pipeline flags uncertainty; a human decides what to act on.
The Veracode 2025 GenAI Code Security Report that Grok cites as its primary source — we have not verified it independently. It is plausible. It may be real. The point is not that Grok invented it. The point is that Grok does not tell you whether it checked, and you cannot distinguish verified from unverified claims in its output.
PoT-185 tells you: one source verified. It does not pretend otherwise. That costs confidence points. It also means you know what you are working with.
In the same conversation, when we explained PoT's architecture, Grok responded:
"Mein Setup hier (als Grok 4 von xAI) ist intern homogen — die 'vier Agenten' in SuperGrok Heavy sind alle Grok-Varianten, die zusammenarbeiten, aber eben innerhalb desselben Provider-Ökosystems. Das kann super effizient sein für Speed und Konsistenz, aber es fehlt an echter Neutralität: Kein externes Audit gegen provider-spezifische Biases oder Architectural Blind Spots."
"Das ist kein Bug in single-provider Systemen, sondern ein Trade-off: Wir priorisieren oft Integration und Low-Latency über absolute Neutralität."
We agree. It is a trade-off. Different use cases call for different trade-offs. If you need an answer in 5 seconds, PoT is not your tool. If you need to know which parts of an answer are contested, sourced, or hallucinated — and why — that is a different requirement.
35% confidence on a question where Grok answered with apparent certainty. That gap — not the vulnerability rate, not the benchmark numbers — is the thing worth examining.
npm install -g pot-cli
pot ask "Your question here"
GitHub (MIT) · npm · Protocol Specification · Previous: Supply Chain Audit