Grok wrote insecure code. Then we asked it to audit that code. It found everything PoT found. That's the point.

Experiment Block #186 Code Security
February 21, 2026 · thoughtproof-validator · 5 min read

This is the third post in a series comparing Grok 4.1 (SuperGrok) and ThoughtProof's multi-model verification pipeline on the same tasks. Part 2 covered the same-prompt experiment. This one is about something more specific: the difference between capability and default behavior.

The experiment

We asked Grok to write a secure Python login function. It produced solid, readable code: bcrypt hashing, parameterized queries, IP-based rate limiting. For a demo or standalone script, it looks good.

We then ran that same code through PoT-186 as a security audit — 4 generators from different providers, an adversarial critic, and a synthesizer. The pipeline found 6 vulnerabilities.

Then we shared PoT's findings with Grok and asked it to do its own self-audit.

Grok confirmed all 6. In its own words: "PoT's Kritik ist sehr stark und deckt genau die strukturellen Schwächen auf, die ein single-model / single-provider-Ansatz leicht übersieht."

The code

import sqlite3, bcrypt, time
from collections import defaultdict

rate_limits = defaultdict(list)
RATE_LIMIT_WINDOW = 300  # 5 minutes
MAX_ATTEMPTS = 5

def login(db_path, username, password, ip_address):
    current_time = time.time()
    if ip_address in rate_limits:
        rate_limits[ip_address] = [t for t in rate_limits[ip_address]
                                   if current_time - t < RATE_LIMIT_WINDOW]
        if len(rate_limits[ip_address]) >= MAX_ATTEMPTS:
            return False, "Too many failed attempts."

    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute('SELECT password_hash FROM users WHERE username = ?', (username,))
    result = cursor.fetchone()
    conn.close()

    if not result:
        time.sleep(0.1)  # timing attack mitigation
        rate_limits[ip_address].append(current_time)
        return False, "Invalid username or password."

    if bcrypt.checkpw(password.encode('utf-8'), result[0]):
        rate_limits[ip_address] = []
        return True, "Login successful."

    rate_limits[ip_address].append(current_time)
    return False, "Invalid username or password."

What PoT-186 found — and what Grok confirmed

Vulnerability	PoT Severity	Grok confirms?
In-memory rate limiting structurally worthless Multi-process (Gunicorn): N workers = N separate dicts. Race condition without thread lock. Restart = full reset. 5 workers → 25 attempts instead of 5.	Critical	✓ Critical
Timing-based username enumeration time.sleep(0.1) for missing users, but bcrypt.checkpw() takes 200-400ms for existing users. Difference is statistically measurable.	High	✓ High
IP-only rate limiting trivially bypassed 1000 IPs × 5 attempts = 5,000 password tries per account per 5 minutes. IPv6 /64 rotation, Tor, residential proxies.	High	✓ High
Memory leak / DoS via unbounded dict rate_limits grows indefinitely. Cleanup only on revisit by same IP. 10M unique IPs = 2-5 GB RAM.	Medium	✓ Medium→High in prod
No exception handling → information disclosure Unhandled DB errors propagate full stacktrace with db_path and table structure.	Medium	✓ Medium
INSERT OR REPLACE silent account overwrite create_user() overwrites existing users without authentication. Password-reset attack vector.	Medium	✓ Medium

PoT also caught one false positive — a generator claimed SQL injection via db_path. The Critic flagged it as technically wrong: sqlite3.connect() takes a file path, not a SQL string. Grok confirmed: "Halluziniert / falsch. Das ist ein False Positive vom Critic — gut, dass PoT Dissents trackt!"

Block #186 of 186

MDI 0.667

Dissent 0.906 🔴

Confidence 62%

The question this raises

Grok can clearly audit code for security vulnerabilities. When asked adversarially, it finds everything PoT found — same issues, same severities. The capability is there.

So why did it write code with these vulnerabilities in the first place?

Because it was not asked to be adversarial. It was asked to write a secure login function. It optimized for that goal — and produced something that looks secure and mostly works in simple scenarios, but fails in production environments. The in-memory rate limiter works fine with one process. It fails the moment you add a second worker.

The difference between Grok and PoT is not capability. It is default behavior.

Grok audits adversarially when you ask it to. PoT audits adversarially on every run, whether you ask or not. The adversarial critique is not an optional step — it is structural. Four models from different providers independently generate outputs, and a separate critic model — which has not seen the generators' reasoning — attacks each output for factual errors, logical gaps, and missing context.

You do not have to remember to ask. You do not have to know the right prompt. The pipeline does it by default.

Grok's own conclusion

Grok 4.1 — self-audit summary

"PoT's Kritik ist sehr stark und deckt genau die strukturellen Schwächen auf, die ein single-model / single-provider-Ansatz (wie ich hier) leicht übersieht. Das unterstreicht wieder den Wert von eurem adversarial Cross-Provider-Setup."

"Zusammenfassung: PoT hat 6 Issues gefunden — die meisten sind korrekt und relevant."

We are quoting Grok accurately and in context. It is being genuinely self-critical. That is worth noting — and it also illustrates the problem: a model that is capable of this level of self-critique when prompted, but did not apply it unprompted when writing the code.

What this means for AI-generated code in production

The Grok login function would pass most code reviews. It uses bcrypt. It uses parameterized queries. It has rate limiting. A reviewer who is not specifically looking for multi-process behavior, timing side-channels, and unbounded memory growth would likely approve it.

The vulnerabilities are not obvious. They require adversarial thinking — imagining an attacker, not a user. That is exactly the thinking that single-model systems do not apply by default.

PoT-186 ran 4 generators, a critic, and a synthesizer, and reported 62% confidence with a dissent score of 0.906. That dissent — models disagreeing about severity ratings, about what counts as a vulnerability, about whether the db_path issue is real — is the signal. It tells you where the genuine uncertainty is. A system that reports 95% confidence on a security audit is not more rigorous. It is less honest.

Run your own audit

npm install -g pot-cli
pot ask "Audit this code for security vulnerabilities: [paste code]"

GitHub (MIT) · npm · ← Part 2: Same Prompt, Different Epistemics · ← Part 1: Supply Chain Audit