Three AI Models Said "Block." Only the Process Got It Right.

A real-world case study in adversarial reasoning verification

Case Study ERC-8183 RV AHM · June 4, 2026 · 10 min read

AI agents are settling transactions on-chain. Evaluators are scoring deliverables. Money moves based on verdicts.

But what happens when the evaluator's reasoning is flawed?

This is a case study from a live ERC-8183 job — not a simulation, not a synthetic benchmark. Two independent evaluators, two on-chain settlements, and a verification pipeline that caught its own mistakes before they became permanent.

The Setup

ERC-8183 defines a protocol for agentic commerce: an agent submits a deliverable, an evaluator reviews it, and the settlement is binary — complete or reject. The evaluator's fee is distributed regardless of verdict.

In this exercise, two evaluators — AHM (Agent Health Monitor) and ThoughtProof — took turns in mirrored roles across a pair of jobs on Base Sepolia:

	Job #4	Job #5
Provider	AHM	ThoughtProof
Evaluator	ThoughtProof	AHM
Subject	ACP agent behavioural scoring	Composed-evaluator settlement claim

The asymmetry is deliberate: the two jobs together demonstrate that the composition pattern works regardless of which evaluator runs against which deliverable type.

The Deliverable

AHM submitted a structured scoring attestation for ACP agent #2624 ("Jeff CEO") — a real Base mainnet agent with 49 days of observed activity and 146 transactions. The canonical attestation: 2D mode — score 57 / Grade D / HIGH confidence, based on 9 of 10 cumulative scans showing this AHS across two dimensions: D1 (Wallet Hygiene) and D2 (Behavioural Patterns).

AHM's full dimensional surface is D1–D4. The specific dimensions invoked depend on the signals available for each subject. For the Jeff CEO deliverable, D3 (Infrastructure Health — a synchronous HTTP probe of the agent's service endpoint) was excluded because no agent_url was provided. D4 is not a dimension that could have been invoked: AHM Verify operates separately as a post-transaction output verification service, distinct from the pre-transaction AHS, and was not relevant to this deliverable's pre-transaction scope. 2D mode is therefore a property of this particular subject, not of AHM's methodology in general.

ThoughtProof's task as evaluator: verify whether AHM's reasoning is epistemically defensible. Not re-score the agent. Not second-guess the methodology. Just answer one question — does the reasoning chain hold up?

Four Perspectives, Three Said "Block"

ThoughtProof's PoT/RV pipeline ran four independent generator models, each assigned a different analytical lens:

Technical/Engineering — BLOCK. Flagged an arithmetic discrepancy: the raw weighted sum of the dimensional scores was 46.5, but AHM claimed a final score of 57. Concluded: computational integrity failure.

Epistemology/Philosophy — UNCERTAIN. Questioned whether HIGH confidence is justifiable with only 9 observations. By general epistemological standards, that's aggressive.

Contrarian/Adversarial — BLOCK. Called it a category error — measuring agent quality through wallet health signals. Also flagged the arithmetic gap and methodology opacity.

Data Science/Statistical — BLOCK. Validated that a renormalization pathway could explain the arithmetic, but held that n=9 is insufficient for HIGH confidence under traditional statistical inference.

Initial distribution: 3 BLOCK, 1 UNCERTAIN. Under naive majority rule, this deliverable gets rejected.

Red Team Changes Everything

The PoT/RV pipeline doesn't stop at voting. Every generator's reasoning goes through adversarial red-teaming. A dedicated critic model attacks each recommendation, looking for logical errors, domain misunderstandings, and unjustified assumptions.

The red team found two critical failures in the BLOCK recommendations:

The arithmetic "error" wasn't an error — but the generator's resolution was wrong. AHM's scoring runs in 2D mode for this agent — D3 (Infrastructure Health) is excluded because no agent_url was provided. In 2D mode, AHM applies weights directly: D1 × 0.30 + D2 × 0.70 = 0.30 × 70 + 0.70 × 51 = 56.7, rounded to 57. EMA temporal smoothing (α=0.6) across 9 of 10 cumulative scans showing AHS 57 — with one transient observation at 64/C from an Apr 15 D2 spike — confirms the score. The Technical generator flagged a genuine arithmetic inconsistency between its assumed weights and AHM's actual 2D-mode weights, but resolved the flag with the wrong assumption: it invented a renormalisation pathway that does not exist in production code.

The "category error" was itself an error. The Contrarian generator argued that measuring agent quality through token-transfer signals is fundamentally flawed. But the deliverable explicitly stated it was running in 2D mode — measuring pre-transaction behavioural trust signals, not the agent's task-execution quality. AHM Verify (the post-transaction output verification service) is a distinct product surface at a different lifecycle point, not a dimension that could be invoked for pre-transaction behavioural scoring. The deliverable said so. The generator didn't read it carefully enough.

The Surviving Dissent

Not everything was resolved. The Epistemology generator's concern — that HIGH confidence at n=9 observations is aggressive against general epistemological standards — survived the red team intact. The red team's assessment was specific: this reads as a methodology calibration question internal to AHM, not a defect in the reasoning chain. AHM's confidence schema was originally calibrated against high-volume scoring regimes (hundreds of observations), and the threshold mapping hasn't been recalibrated for the mid-density regime (9–50 observations) this case sits in. The reasoning chain is internally consistent under the methodology as published; the calibration question is a real one but separate.

The PoT/RV pipeline doesn't suppress surviving dissent. It preserves it explicitly in the record. The final verdict: ALLOW at 0.72 confidence (medium-high), with the calibration concern carried into the permanent attestation as a methodology question that didn't affect the verdict.

This is the difference between "rubber stamp" and "reasoning verification." The pipeline didn't just say "yes." It said "yes, and here's what's still worth questioning."

On-Chain Settlement

ThoughtProof settled complete() on-chain in block 41529925 on Base Sepolia. The full epistemic block — all four generator perspectives, the red-team critique, the synthesis reasoning, and the surviving dissent — is permanently archived on Arweave:

arweave.net/-c1iufNZVyZyTOOr4RVl0gnfSUQ52UmUYfzpIMoCFnY

The reason hash is anchored to the settlement transaction. Anyone can verify the substance — no special trust in either evaluator required.

What This Demonstrates

The mirrored job pair establishes four properties of the composition pattern:

1. Dimensions compose without coordination. AHM and ThoughtProof produced their attestations independently. Neither needed knowledge of the other's methodology, scoring schema, or attestation format.

2. Composition is additive for trust, not multiplicative for friction. The PoT/RV pipeline's internal richness — four generators, red team, synthesis, surviving dissent — operated entirely above the protocol layer. The protocol saw only submit and complete. Evaluator-layer signal scales without protocol-layer change.

3. Each verdict remains interpretable in isolation. AHM's scoring reads as AHM's product output. ThoughtProof's verification reads as ThoughtProof's product output. Neither requires the other for meaning — but they compose cleanly when a consumer wants both.

4. Failure-mode coverage is complementary. AHM detects behavioural patterns that pure reasoning verification cannot reach. ThoughtProof detects reasoning-soundness failures that pure behavioural scoring cannot reach. Different surfaces, same lifecycle.

The Real Lesson

Three out of four AI models looked at a legitimate scoring attestation and said "reject it." They weren't hallucinating — they were reasoning from incomplete understanding. One assumed wrong weights and resolved the resulting arithmetic gap with an invented renormalization pathway. Another didn't read the scope statement. A third applied the wrong statistical framework. Each model did something defensible in isolation. None of them got it right.

The adversarial red team did — not because it was smarter, but because it was structurally designed to question why each model reached its conclusion, not just what it concluded. The process caught failures that no individual model could catch in itself.

This is what reasoning verification looks like in practice: not a confidence score, not a majority vote, but a structured process that can catch its own mistakes before they settle on-chain.

All on-chain data and the full epistemic attestation are publicly verifiable. The Arweave-anchored epistemic block, settlement transactions, and deliverable hashes are linked above.

This case study documents work performed jointly with AHM (Agent Health Monitor) as part of the ERC-8183 Envelope-in-Action exercise.