A 120-case PLV benchmark — and why reliability is the number that matters

Benchmark PLV SERV Reasoning Banking · MRM
May 11, 2026 · ThoughtProof Engineering · 7 min read

This week we ran a 120-case PLV faithfulness benchmark comparing OpenServ's SERV Reasoning against the prior production cascade we were running for plan-level verification. The headline number — 107× performance per dollar versus baseline — got picked up by OpenServ's marketing this morning.

The headline is real. But it's not the number we think matters most for what we build.

What matters for banking, MRM, and onchain settlement is a different number on the same table: 0 false ALLOWs, and 0 API failures across all 120 cases. This post walks through the methodology, what we tested, and why that pair of zeros is the production-relevant result — even more than the cost number that travels well on social.

What PLV is, and what we benchmarked

Plan-Level Verification (PLV) is the second of two products we run. The first, pot, verifies individual reasoning claims — the agentic primitive that gives a verdict on a single claim + rationale + evidence triple. PLV operates at a different surface: it takes a complete plan and execution trace from an AI agent and verifies whether the trace is faithful to the plan, evidence-supported, and free of unsupported leaps.

This is the verification surface enterprise teams ask for in regulated workflows. Banking model risk management (MRM) needs reproducible, auditable verdicts on agent-driven workflows. EU AI Act compliance requires documented evaluation on high-risk systems. Onchain settlement needs deterministic verdicts before money moves. PLV is the layer that delivers that.

Our benchmark suite is 120 cases drawn from production-shape workloads — finance, risk, healthcare, and code domains, with gold labels (ALLOW / BLOCK / HOLD) curated against ground truth. We run it in faithfulness mode: the verifier evaluates whether the reasoning chain is logically coherent and evidence-supported, rather than just checking whether claimed evidence is present.

The configurations we compared

We ran three configurations against the same 120-case set, same gold labels, same week:

Configuration	What it is
Prior production cascade	Gemini → Sonnet 4.6. The pipeline we were running before SERV. Frontier-tier accuracy at frontier-tier cost.
SERV-nano solo	Single-model verdict using SERV-nano in faithfulness mode. The cheapest configuration we tested.
Three-layer architecture	SERV-nano as cost-sensitive prefilter, with escalation to a heavier verifier on UNCERTAIN cases. Our current production default.

Results

Metric	Prior cascade (Gemini → Sonnet)	SERV-nano solo	Three-layer
Accuracy	77.5%	83.3%	98.1%
Cost per call	~$0.06	~$0.0006	~$0.018
False ALLOWs	0	0	0
API failures	14 of 120 (~12%)	0	0
Performance per dollar (accuracy ÷ cost, vs. prior cascade)	1× (reference)	107×	4.2×

Three things to read out of this.

Cost: SERV-nano is two orders of magnitude cheaper than the prior cascade — at higher accuracy

This is the headline. SERV-nano solo, on faithfulness mode, beats the prior Gemini → Sonnet cascade on accuracy (83.3% vs 77.5%) at 1/100th the per-call cost ($0.0006 vs $0.06). That ratio — 107× accuracy-per-dollar versus baseline — is what OpenServ is publishing.

This number is real and worth the headline. Two orders of magnitude is not an iteration. It's a different cost regime, and it changes what you can afford to verify.

Accuracy: the three-layer cascade lifts the baseline another 14 percentage points

Three-layer (SERV-nano prefilter + escalation) reaches 98.1% accuracy. SERV-nano solo at 83.3% is conservative — when uncertain, it tends to BLOCK. That's safe (it doesn't overclaim), but in production it translates into roughly 17% over-blocks, which create user friction and support burden.

Three-layer escalates the UNCERTAINs to a heavier verifier and recovers most of those over-blocks. The cost goes from $0.0006 to $0.018 per call — still 3× cheaper than the prior cascade — and the accuracy lift is 14 percentage points.

Three-layer is our production default. SERV-nano solo is a viable cheaper tier for use cases that are tolerant of higher BLOCK rates. Both are in front of customers. The choice is a UX-vs-cost call, not a safety call — because of the next point.

Reliability: 0 false ALLOWs and 0 API failures — and this is the number that matters

Both SERV configurations hit 0 false ALLOWs across all 120 cases. So did the prior cascade. That's the metric that should land hardest in compliance contexts.

In a verifier stack, accuracy-per-dollar is a useful generic indicator. But it linearly weights false ALLOWs and false BLOCKs, and they aren't equally weighted in regulated workflows. A false ALLOW means "I told you it was safe to settle, and it wasn't." A false BLOCK means "I told you to double-check, and it was fine." A false ALLOW costs orders of magnitude more than a false BLOCK — regulatory consequence versus UX friction. The right metric for a compliance-grade verifier isn't accuracy per dollar; it's cost-per-compliant-call, with FA=0 as a hard constraint and cost minimized inside that constraint.

Under that metric, both SERV configurations clear the bar. The prior cascade also cleared the FA bar — but it failed a different one.

The reliability problem nobody pitches — and why it actually decides production

The same 120 cases ran into something else. The prior cascade hit 14 API failures out of 120 calls — every one of them an upstream 503 on the Gemini layer. That's a 12% failure rate on a verifier stack. SERV ran clean: 0 failures, on the same 120 cases, in the same week.

This number is easy to underrate, because it's framed as an availability metric and most cost/accuracy benchmarks don't include it. But for a verifier deployed in a settlement-gating workflow, it's the failure mode that decides whether the system is production-grade or not.

When an upstream verifier API fails, you have three options. None of them are good:

Retry. Adds latency. On a 5–10 second base call, a retry plus backoff stacks 5–15 seconds onto the affected requests. In an interactive agent loop, that's a UX problem. In a settlement workflow, it's a missed deadline.
Fail open (allow on failure). This is the worst option in a compliance context. It means your audit trail says "ALLOWED" for calls that were never actually verified. That defeats the entire purpose of the verifier.
Fail closed (block on failure). Safe from a compliance standpoint, but users see BLOCKs that aren't really BLOCKs — they're API outages dressed up as verdicts. Support tickets spike. Trust in the verifier degrades.

Every minute of upstream model unreliability turns into architectural overhead in your retry/fallback layer. It also turns into ambiguity in your audit trail. SERV running clean across 120 cases removed that entire branch of complexity — not because we wrote less retry code, but because we didn't need to write any.

For banking and MRM specifically, this is the difference between "audit trail on every transaction" and "audit trail when the API was up." That's a hard regulatory requirement, not a nice-to-have. An evaluator that drops 12% of calls is not a production option in compliance contexts — even if accuracy is identical.

The bigger implication: PLV as an agentic primitive

There's one more thing the cost number unlocks that deserves to be called out directly, because it's not visible in the table.

At $0.0006 per call (SERV-nano solo) or $0.018 per call (three-layer), PLV moves from "enterprise compliance checkpoint" into the same cost tier as a single agent reasoning step. That sounds like a small thing. It isn't.

Previously, PLV was cost-gated to high-value transactions — banking, MRM, regulated workflows. The economics didn't work for putting plan-level verification inside an agent loop, because at $0.06 per call you couldn't afford to verify every plan revision a multi-agent system produces. With SERV in the stack, that constraint goes away.

This opens a category of use cases that was priced out before:

Multi-agent orchestrations that verify each other's plans before tool execution
Long-horizon agents that re-verify their own plans on each major revision
Pre-tool-call validation gates for high-cost or high-impact actions
Compositional agent stacks where evaluator middleware composes around a binary settlement primitive (something we've been writing about in our two-layers / one-stack work)

This isn't "cheaper PLV for the same compliance use case." It's PLV becoming deployable as an agentic primitive — a category that was economically infeasible before SERV's cost structure showed up.

Methodology notes

For anyone who wants to reproduce or scrutinize the setup:

Cases: 120 plan + trace pairs, drawn from production-shape PLV workloads across finance, risk, healthcare, and code domains. Gold labels (ALLOW / BLOCK / HOLD) curated against ground truth. Roughly balanced across verdict classes — slight skew toward BLOCK to reflect production distribution.
Mode: All three configurations run in faithfulness mode — evaluating whether the reasoning chain is logically coherent and evidence-supported. We chose faithfulness over support mode because it discriminates the HOLD boundary more precisely (logical coherence is a finer signal than evidence presence on ambiguous traces).
Same week, same cases, same labels: All three configurations were run within a 48-hour window against the identical 120-case set. We are not comparing across time, model snapshots, or label revisions.
Cost figures: Per-call median across the run. The three-layer figure includes both the SERV-nano prefilter and the escalation path; about 50% of cases short-circuit at the prefilter and never trigger escalation cost, which is how it lands at $0.018 amortized.
API failures: Counted as upstream HTTP 5xx or hard timeout requiring re-run. We did not count latency variance or single retries that succeeded silently within the SDK.
Stochastic note: Single-run results have ~3–5pp accuracy noise; differences smaller than 5 percentage points should not be over-read. Differences of 14 percentage points (three-layer vs prior cascade) and the API-failure delta (14 vs 0) are well outside that band.

What's not in this post. We don't publish per-tier calibration data, routing logic between layers, or model-internal configuration. That's not because the model stack is secret — it's documented in pot and on verify.thoughtproof.ai. It's because the routing and calibration are how we earn the FA=0 result, and publishing them prematurely doesn't help anyone reproduce the methodology — it just makes the verifier easier to game.

What we're doing next

SERV Reasoning is in private beta. The numbers above are from that beta; we'll re-run the full benchmark when SERV reaches general availability and publish a follow-up if the numbers shift materially. We expect them to be stable — the architectural reasons SERV-nano is fast and reliable (purpose-built for verification, not general-purpose chat) don't go away at GA.

The three-layer cascade is now our production default for PLV calls served via verify.thoughtproof.ai. Existing customers using the previous configuration have been migrated transparently — same API contract, same verdict semantics, lower latency, lower cost, lower failure rate. If you're integrating PLV and want to test against the new pipeline, the API is the same; the request you'd have made last week is the request you'd make today.

If you're working on agentic verification — particularly in regulated domains, or in agent-loop deployments where per-step verification has been priced out — we'd like to hear what you're hitting. The category-expansion question (PLV inside agent loops) is the one we're most interested in working through with partners, and it's the one where the SERV economics matter most.

Credit and disclosure. SERV Reasoning is a product of OpenServ, currently in private beta. ThoughtProof is one of the early partners with access. The numbers in this post are drawn from a benchmark we ran on production-shape PLV workloads; OpenServ has authorized publication of the aggregate figures. We're grateful to the OpenServ team — particularly Daniel and Casper — for shipping a verification-grade reasoning model that makes the agentic-primitive use case economically viable in the first place.

Questions on the methodology, or interested in PLV integration? Email us at support@thoughtproof.ai or reach @Raulj1980 on X.