This week we ran a 120-case PLV faithfulness benchmark comparing OpenServ's SERV Reasoning against the prior production cascade we were running for plan-level verification. The headline number — 107× performance per dollar versus baseline — got picked up by OpenServ's marketing this morning.
The headline is real. But it's not the number we think matters most for what we build.
What matters for banking, MRM, and onchain settlement is a different number on the same table: 0 false ALLOWs, and 0 API failures across all 120 cases. This post walks through the methodology, what we tested, and why that pair of zeros is the production-relevant result — even more than the cost number that travels well on social.
Plan-Level Verification (PLV) is the second of two products we run. The first, pot, verifies individual reasoning claims — the agentic primitive that gives a verdict on a single claim + rationale + evidence triple. PLV operates at a different surface: it takes a complete plan and execution trace from an AI agent and verifies whether the trace is faithful to the plan, evidence-supported, and free of unsupported leaps.
This is the verification surface enterprise teams ask for in regulated workflows. Banking model risk management (MRM) needs reproducible, auditable verdicts on agent-driven workflows. EU AI Act compliance requires documented evaluation on high-risk systems. Onchain settlement needs deterministic verdicts before money moves. PLV is the layer that delivers that.
Our benchmark suite is 120 cases drawn from production-shape workloads — finance, risk, healthcare, and code domains, with gold labels (ALLOW / BLOCK / HOLD) curated against ground truth. We run it in faithfulness mode: the verifier evaluates whether the reasoning chain is logically coherent and evidence-supported, rather than just checking whether claimed evidence is present.
We ran three configurations against the same 120-case set, same gold labels, same week:
| Configuration | What it is |
|---|---|
| Prior production cascade | Gemini → Sonnet 4.6. The pipeline we were running before SERV. Frontier-tier accuracy at frontier-tier cost. |
| SERV-nano solo | Single-model verdict using SERV-nano in faithfulness mode. The cheapest configuration we tested. |
| Three-layer architecture | SERV-nano as cost-sensitive prefilter, with escalation to a heavier verifier on UNCERTAIN cases. Our current production default. |
| Metric | Prior cascade (Gemini → Sonnet) |
SERV-nano solo | Three-layer |
|---|---|---|---|
| Accuracy | 77.5% | 83.3% | 98.1% |
| Cost per call | ~$0.06 | ~$0.0006 | ~$0.018 |
| False ALLOWs | 0 | 0 | 0 |
| API failures | 14 of 120 (~12%) | 0 | 0 |
| Performance per dollar (accuracy ÷ cost, vs. prior cascade) |
1× (reference) | 107× | 4.2× |
Three things to read out of this.
This is the headline. SERV-nano solo, on faithfulness mode, beats the prior Gemini → Sonnet cascade on accuracy (83.3% vs 77.5%) at 1/100th the per-call cost ($0.0006 vs $0.06). That ratio — 107× accuracy-per-dollar versus baseline — is what OpenServ is publishing.
This number is real and worth the headline. Two orders of magnitude is not an iteration. It's a different cost regime, and it changes what you can afford to verify.
Three-layer (SERV-nano prefilter + escalation) reaches 98.1% accuracy. SERV-nano solo at 83.3% is conservative — when uncertain, it tends to BLOCK. That's safe (it doesn't overclaim), but in production it translates into roughly 17% over-blocks, which create user friction and support burden.
Three-layer escalates the UNCERTAINs to a heavier verifier and recovers most of those over-blocks. The cost goes from $0.0006 to $0.018 per call — still 3× cheaper than the prior cascade — and the accuracy lift is 14 percentage points.
Three-layer is our production default. SERV-nano solo is a viable cheaper tier for use cases that are tolerant of higher BLOCK rates. Both are in front of customers. The choice is a UX-vs-cost call, not a safety call — because of the next point.
Both SERV configurations hit 0 false ALLOWs across all 120 cases. So did the prior cascade. That's the metric that should land hardest in compliance contexts.
In a verifier stack, accuracy-per-dollar is a useful generic indicator. But it linearly weights false ALLOWs and false BLOCKs, and they aren't equally weighted in regulated workflows. A false ALLOW means "I told you it was safe to settle, and it wasn't." A false BLOCK means "I told you to double-check, and it was fine." A false ALLOW costs orders of magnitude more than a false BLOCK — regulatory consequence versus UX friction. The right metric for a compliance-grade verifier isn't accuracy per dollar; it's cost-per-compliant-call, with FA=0 as a hard constraint and cost minimized inside that constraint.
Under that metric, both SERV configurations clear the bar. The prior cascade also cleared the FA bar — but it failed a different one.
The same 120 cases ran into something else. The prior cascade hit 14 API failures out of 120 calls — every one of them an upstream 503 on the Gemini layer. That's a 12% failure rate on a verifier stack. SERV ran clean: 0 failures, on the same 120 cases, in the same week.
This number is easy to underrate, because it's framed as an availability metric and most cost/accuracy benchmarks don't include it. But for a verifier deployed in a settlement-gating workflow, it's the failure mode that decides whether the system is production-grade or not.
When an upstream verifier API fails, you have three options. None of them are good:
Every minute of upstream model unreliability turns into architectural overhead in your retry/fallback layer. It also turns into ambiguity in your audit trail. SERV running clean across 120 cases removed that entire branch of complexity — not because we wrote less retry code, but because we didn't need to write any.
There's one more thing the cost number unlocks that deserves to be called out directly, because it's not visible in the table.
At $0.0006 per call (SERV-nano solo) or $0.018 per call (three-layer), PLV moves from "enterprise compliance checkpoint" into the same cost tier as a single agent reasoning step. That sounds like a small thing. It isn't.
Previously, PLV was cost-gated to high-value transactions — banking, MRM, regulated workflows. The economics didn't work for putting plan-level verification inside an agent loop, because at $0.06 per call you couldn't afford to verify every plan revision a multi-agent system produces. With SERV in the stack, that constraint goes away.
This opens a category of use cases that was priced out before:
This isn't "cheaper PLV for the same compliance use case." It's PLV becoming deployable as an agentic primitive — a category that was economically infeasible before SERV's cost structure showed up.
For anyone who wants to reproduce or scrutinize the setup:
SERV Reasoning is in private beta. The numbers above are from that beta; we'll re-run the full benchmark when SERV reaches general availability and publish a follow-up if the numbers shift materially. We expect them to be stable — the architectural reasons SERV-nano is fast and reliable (purpose-built for verification, not general-purpose chat) don't go away at GA.
The three-layer cascade is now our production default for PLV calls served via verify.thoughtproof.ai. Existing customers using the previous configuration have been migrated transparently — same API contract, same verdict semantics, lower latency, lower cost, lower failure rate. If you're integrating PLV and want to test against the new pipeline, the API is the same; the request you'd have made last week is the request you'd make today.
If you're working on agentic verification — particularly in regulated domains, or in agent-loop deployments where per-step verification has been priced out — we'd like to hear what you're hitting. The category-expansion question (PLV inside agent loops) is the one we're most interested in working through with partners, and it's the one where the SERV economics matter most.
Credit and disclosure. SERV Reasoning is a product of OpenServ, currently in private beta. ThoughtProof is one of the early partners with access. The numbers in this post are drawn from a benchmark we ran on production-shape PLV workloads; OpenServ has authorized publication of the aggregate figures. We're grateful to the OpenServ team — particularly Daniel and Casper — for shipping a verification-grade reasoning model that makes the agentic-primitive use case economically viable in the first place.
Questions on the methodology, or interested in PLV integration? Email us at support@thoughtproof.ai or reach @Raulj1980 on X.