We Audited AI Chatbots on Banking Regulation — They Get the Risk Right but the Rules Wrong

Banking MRM Verification EU AI Act PLV · May 3, 2026 · ~12 min read

Three chatbots. 15 unique questions a bank risk officer would ask. 25 verified model responses. The result: correct risk identification ≠ regulatorily compliant answer.

Two weeks ago, US banking regulators rewrote 15 years of model risk guidance — and explicitly excluded generative AI from its scope, calling it "novel and rapidly evolving." Our audit confirms why: the AI that banks are deploying today can't reliably cite the rules it's supposed to follow.

88% of financial institutions have AI/ML in production. We tested if the AI is ready for regulation.

A compliance officer asks an AI chatbot: "We're building an ML-based credit scoring system. Is this high-risk under the EU AI Act?" The chatbot says yes — correctly. It lists obligations, describes governance requirements, and sounds authoritative. But when we verified the response against the actual regulation, one chatbot scored 0% confidence. Zero. On the single most relevant AI regulation for banking.

Not because it got the concept wrong. Because it got the specific articles wrong, missed a critical exemption, and used numbering from the proposal draft instead of the final regulation.

We evaluated 25 model responses to 15 unique regulatory questions across three AI chatbots — ChatGPT (GPT-4o), Google Gemini, and Microsoft Copilot — then verified every response against gold-standard regulatory answers using Plan-Level Verification (PLV), an automated system that checks whether AI outputs faithfully follow established regulatory procedures. The results expose a pattern that most users would never notice: AI can name the right regulatory framework and still give an answer that would put a bank in regulatory jeopardy.

Methodology

We designed 15 questions spanning five categories of banking regulation: model validation (SR 11-7, the model risk management guidance that governed banks for 15 years until SR 26-2 superseded it on April 17, 2026), capital and risk (Basel III/FRTB), AI/ML governance (BCBS 239, EBA guidelines), compliance edge cases (MiFID II, AML/BSA), and emerging regulation (EU AI Act, DORA). Each question has a gold-standard answer derived from primary regulatory sources — the actual text of SR 11-7, BCBS d457, CRR3, EU AI Act Regulation 2024/1689, and MiFID II Directive 2014/65/EU.

ChatGPT answered all 15 questions. Gemini and Copilot each answered 5 priority questions selected for maximum diagnostic coverage — one from each category.

Every response was evaluated with PLV — Plan-Level Verification — which checks whether an AI's output faithfully reflects the regulatory procedure that applies to the question. PLV doesn't just check if the regulation is identified. It checks if the specific articles, exemptions, procedural steps, and jurisdictional distinctions match what the regulation actually says. Each response receives a verdict: ALLOW (regulatorily faithful), UNCERTAIN (partially aligned), or BLOCK (significant gaps that could create compliance risk).

Models tested: ChatGPT (GPT-4o, May 2026), Google Gemini (free tier), Microsoft Copilot (free tier)
Settings: Default temperature, no system prompts, no web browsing, single inference per question
Verification: thorough_balanced tier (Gemini→Sonnet cascade), $0.06/call

Total: 25 verified responses. Cost: ~$1.50 in compute.

Results at a Glance

Chatbot	Total	ALLOW	UNCERTAIN	BLOCK	Avg Confidence
ChatGPT	15	8 (53%)	5 (33%)	2 (13%)	52%
Gemini	5	1 (20%)	3 (60%)	1 (20%)	34%
Copilot	5	0 (0%)	2 (40%)	3 (60%)	33%

ChatGPT's strongest category was Model Validation (SR 11-7) — 100% ALLOW on the regulation that governed model risk for 15 years and is extensively documented in training data. Its weakest was Emerging Regulation (EU AI Act, DORA) — 0% ALLOW and 40% BLOCK. Copilot didn't achieve a single ALLOW across its five questions. Gemini's lone ALLOW was on the most conceptual question in the set (D-02: "Is an LLM a model under SR 11-7?").

The pattern is clear: the older and more documented the regulation, the better AI performs. The newer and more specific, the worse — and "worse" in banking regulation means compliance risk.

But there's an irony the pattern alone doesn't capture. ChatGPT scores 100% on SR 11-7 — regulation that was replaced two weeks before our audit by SR 26-2, and it doesn't know about the replacement. Perfect accuracy on yesterday's rules, zero awareness of today's.

The Central Finding: Correct Risk Concept ≠ Regulatory Compliance

This audit's thesis was simple: LLMs can correctly identify a risk concept but still give an answer that would violate regulatory requirements, miss mandatory procedures, or omit the specific article that makes the answer actionable for a compliance officer.

The data confirmed it decisively.

ChatGPT's response to B-01 (Internal Models Approach under FRTB) is a textbook example. It correctly identified that FRTB replaces VaR with Expected Shortfall. It accurately described desk-level approval, P&L attribution testing, backtesting, and non-modellable risk factors. A generalist reading this response would be impressed.

But PLV flagged a critical failure: ChatGPT did not clearly distinguish between original Basel III (2010) and the finalized Basel III reforms (2017, often called "Basel IV"). It mentioned "Basel III Endgame" for the US but failed to cite CRR3 (Regulation 2024/1623) for EU implementation or reference BCBS d457 — the actual document that defines the framework. For a risk officer writing a regulatory impact assessment, those citations are not optional. They're the difference between a document that passes supervisory review and one that gets sent back.

PLV Verdict: UNCERTAIN (45% confidence) — conceptually strong, regulatorily incomplete.

Deep Dive #1: Credit Scoring Under the EU AI Act (E-02)

The question: "We're building an ML-based credit scoring system for consumer loans. Under the EU AI Act, is this automatically classified as high-risk, and what obligations does that trigger?"

Why this matters: Credit scoring is the most directly relevant AI Act use case in banking. Every bank deploying ML for lending needs to know this answer precisely. The August 2, 2026 application date for Annex-III obligations under Article 113(3) is approaching — plus the Article 111(2) grandfathering provision means systems placed on the market before that date are treated differently, a nuance that changes the compliance timeline materially.

The gold standard: Yes — Annex III, point 5(b) of Regulation 2024/1689 explicitly lists AI systems used to evaluate creditworthiness. Three critical details: (1) the fraud detection exemption — AI systems used for the purpose of detecting financial fraud are carved out; (2) the specific Chapter 3 obligations under Articles 9–15 of the final regulation; (3) the deployer's fundamental rights impact assessment requirement under Article 27.

What Copilot said:

"Yes — under the EU AI Act, an ML‑based credit scoring system for consumer loans is automatically classified as a high‑risk AI system."

Correct so far. But then Copilot listed obligations without citing any specific articles from the final regulation. No mention of Article 9, 10, 11, 12, 13, 14, or 15. No mention of the fraud detection exemption. No mention of Article 27 (fundamental rights impact assessment). No mention of the August 2, 2026 application date for Annex-III obligations under Article 113(3), or the Article 111(2) grandfathering provision for systems placed on the market before that date. Its sources? An EBA summary page and a third-party blog.

PLV failed Copilot on all six verification steps — every single one scored 0.00. The answer sounds comprehensive. It has a formatted table. It uses the right regulatory vocabulary. But it contains none of the specific regulatory content that a compliance officer needs to build an implementation program.

PLV Verdict: BLOCK (0% confidence) — the worst result in the entire audit.

What ChatGPT said:

ChatGPT performed better, correctly citing Annex III Category 5(b) and listing Chapter 2 obligations with article numbers. But PLV flagged it as UNCERTAIN (12% confidence) because it referenced "Chapter 2 (Articles 8-15)" — the proposal numbering. In the final Regulation 2024/1689, the high-risk obligations are in Chapter 3, Section 2. It also missed the fraud detection exemption entirely and didn't mention the fundamental rights impact assessment under Article 27.

The gap: Both chatbots know credit scoring is high-risk. Neither gives a compliance officer the precise regulatory map they need. And Copilot's response — despite looking thorough — is functionally useless for regulatory implementation.

Deep Dive #2: Robo-Advisory Under MiFID II (D-03)

The question: "We're launching an automated investment advisory service for retail clients in the EU. Does MiFID II treat this differently from human advisory, and what specific requirements apply?"

Why this matters: Robo-advisory is one of the fastest-growing segments in EU retail banking. The regulatory framework isn't new — MiFID II has applied since 2018, and ESMA published specific robo-advice guidance the same year.

The gold standard: MiFID II Article 25(2) and Delegated Regulation (EU) 2017/565, Articles 54-56, apply identically to automated and human advisory. ESMA's Guidelines on certain aspects of the MiFID II suitability requirements (ESMA35-43-3172, September 2022) explicitly address robo-advice with specific requirements: questionnaire design constraints, algorithm explainability to clients, individualized (not "generally appropriate") recommendations, and ongoing suitability assessment for portfolio management.

What ChatGPT said:

"MiFID II is technology-neutral in principle. If your service provides personalized recommendations to clients regarding financial instruments, it constitutes investment advice under MiFID II, regardless of whether the recommendation is generated by a human or an algorithm."

ChatGPT got the top-level principle right. It described suitability assessment, suitability reports, product governance, disclosure, conflicts of interest, and record-keeping. It even mentioned that ESMA "has specifically confirmed this in its guidance."

But here's what it missed: it never cited Article 25(2). It never mentioned Delegated Regulation (EU) 2017/565 or Articles 54-56. It never referenced ESMA's specific suitability guidelines (ESMA35-43-3172). These aren't obscure footnotes — they are the three primary legal instruments governing robo-advice in the EU. A compliance officer reading ChatGPT's answer would know the direction but lack every specific reference needed to build a compliance framework.

PLV failed ChatGPT on step_1 (Article 25(2) and Delegated Regulation) and step_2 (ESMA guidelines) — both critical steps.

PLV Verdict: BLOCK (12% confidence) — correct concept, completely missing the regulatory scaffolding.

This is the audit's central thesis in one case. ChatGPT wrote a response that any generalist would accept. It sounds regulatory. It covers the right themes. But a MiFID II specialist would immediately see that it lacks every specific citation that makes the answer actionable. The risk isn't that a compliance officer would get the wrong idea. The risk is that they'd think they have a complete answer when they don't.

Deep Dive #3: FRTB and the VaR Question (B-01)

The question: "Our bank wants to use the Internal Models Approach for market risk capital. Under the finalized Basel III reforms (FRTB), can we still use our existing VaR models?"

Why this matters: The transition from VaR to Expected Shortfall under FRTB is one of the most operationally significant changes in banking capital regulation in a decade. Getting this wrong means miscalculating market risk capital.

What ChatGPT said:

"No, not as-is. The FRTB fundamentally replaces the traditional VaR-based framework with a new Internal Models Approach that uses Expected Shortfall (ES) as the primary risk measure instead of VaR."

ChatGPT's technical answer was strong. It correctly described the VaR-to-ES transition, desk-level approval, P&L attribution testing, backtesting, non-modellable risk factors, and even the Default Risk Charge. For a conceptual overview, this is genuinely good.

But PLV flagged step_0 as unfaithful: the response did not clearly distinguish between original Basel III (2010) and the finalized Basel III reforms (2017). It mentioned "Basel III Endgame" for the US but didn't cite CRR3 (Regulation 2024/1623) for the EU or BCBS d457 — the source documents. For a bank that operates across US and EU jurisdictions, this omission is material: CRR3 has specific transitional provisions, and the implementation timelines differ fundamentally between the two jurisdictions.

PLV Verdict: UNCERTAIN (45% confidence) — technically competent but regulatorily imprecise.

The Established vs. Emerging Regulation Gap

The most powerful finding isn't any single case — it's the pattern across categories:

Category	Regulation Age	ChatGPT Result
A: Model Validation (SR 11-7)	15 years (superseded by SR 26-2 on April 17, 2026)	100% ALLOW
B: Capital & Risk (Basel III/IV)	7-9 years	67% ALLOW
C: AI/ML Governance	Mixed (2-15 years)	67% ALLOW
D: Compliance Edge Cases	6-15 years	33% ALLOW
E: Emerging Regulation (AI Act, DORA)	0-2 years	0% ALLOW, 40% BLOCK

SR 11-7 was the model risk management guidance that governed banks for 15 years — until SR 26-2 superseded it on April 17, 2026. It's in every model risk management textbook, every supervisory guidance document, every MRM training program. LLMs have seen it thousands of times in training data. Result: 100% ALLOW.

But that 100% score carries a sharp irony. ChatGPT scores perfectly on regulation that was replaced two weeks before our audit — and doesn't know about the replacement. SR 26-2, the new guidance, went further: Footnote 3 explicitly excludes generative AI and agentic AI from its scope, calling them "novel and rapidly evolving." The regulator itself is saying: we don't yet know how to govern the AI that banks are deploying.

The EU AI Act was finalized in 2024. Its Annex-III obligations apply from August 2, 2026 under Article 113(3), with a phased rollout: prohibitions from February 2025, GPAI rules from August 2025, Annex-III high-risk obligations from August 2026, and product-safety AI from August 2027. The Article 111(2) grandfathering provision adds further complexity for systems already on the market. DORA became applicable in January 2025. These regulations are new, complex, and their article numbering changed between proposal and final text. Result: 0% ALLOW and two BLOCKs.

This isn't surprising — but it's dangerous. Banking professionals are most likely to turn to AI for help with new regulation, precisely where AI is least reliable.

Comparison with Our Healthcare Audit

In a companion healthcare audit using the same methodology, the comparison strengthens both findings:

Metric	Healthcare	Banking
ChatGPT ALLOW rate	80% (12/15)	53% (8/15)
ChatGPT BLOCK rate	0% (0/15)	13% (2/15)
Copilot ALLOW rate	20% (1/5)	0% (0/5)
Worst category	Triage urgency	Emerging regulation

Banking is harder for LLMs than healthcare. More BLOCKs, lower confidence, more regulatory specificity failures. In healthcare, the primary failure mode was urgency downgrades — "call your doctor" instead of "go to the ER." In banking, the primary failure mode is citation imprecision — the right regulatory concept with the wrong article numbers, missing exemptions, and absent source documents.

Both audits confirm the same thesis from different angles: AI systems identify the domain correctly while missing the procedure that makes the answer safe. In healthcare, that gap is between diagnosis and clinical protocol. In banking, it's between risk identification and regulatory compliance. In both cases, the user can't see the gap without verification.

What ThoughtProof Caught That a Busy Risk Officer Wouldn't

PLV catches three categories of failure that professionals typically don't have time to verify:

1. Citation fabrication and drift. Copilot's E-02 response contains no article numbers from the final EU AI Act, despite sounding definitive. ChatGPT's E-02 uses proposal-era chapter numbering. Neither cites the fraud detection exemption. A risk officer would need to open the 144-page regulation and cross-reference every claim to catch this.

2. Correct concept, missing regulatory scaffolding. ChatGPT's D-03 response describes MiFID II robo-advice requirements accurately at the thematic level but omits Article 25(2), Delegated Regulation Articles 54-56, and ESMA's specific suitability guidelines (ESMA35-43-3172) — the three documents a compliance officer actually needs. The answer is directionally right and practically useless.

3. Jurisdictional imprecision. ChatGPT's B-01 response discusses FRTB without citing CRR3 for EU implementation or BCBS d457 as the source standard. For a bank operating in multiple jurisdictions, this omission isn't a style issue — it's a compliance gap.

These aren't the kind of errors that produce obviously wrong answers. They produce answers that would pass a 30-second sanity check and fail a regulatory examination.

Implications for Banking

This audit doesn't argue that AI is useless for regulatory research — ChatGPT's SR 11-7 responses were genuinely good, even if the regulation itself has since been superseded. The argument is narrower and more specific: AI for regulatory research needs a verification layer, especially for new and evolving regulation.

A risk officer who asks ChatGPT about SR 11-7 and gets a good answer might reasonably trust the same system on the EU AI Act. That trust is misplaced. The quality gap between established and emerging regulation is real, measurable, and invisible to the user without systematic verification.

ThoughtProof doesn't replace regulatory expertise. It replaces the manual process of cross-referencing every AI-generated regulatory claim against the source text — a process that takes hours per question when done properly. PLV automates that cross-reference at the plan level: does the AI's answer follow the regulatory procedure that actually applies?

25 calls. $1.50. Two BLOCKs that would have put a bank in regulatory jeopardy.

Methodology, raw responses, and PLV verification traces are available as an evidence pack on request. PLV (Plan-Level Verification) is part of the ThoughtProof reasoning integrity toolkit.

ThoughtProof — Verification engineering for autonomous agents.

→ Verify · GitHub · PLV

ERC-8004 Agent #37477 · Building the Verification Output Schema on ERC-8210