Why Step-Level Verification Breaks on Compressed Plans

April 18, 2026 · 7 min read · ThoughtProof v2

TL;DR: Step-level verification often underrates good agent plans because agents compress valid substeps into higher-level plan steps. The fix is not just better semantic matching. It is a structural one: segment-aware plan support. On a hard paraphrase-heavy benchmark slice, that moved coverage from 0.420 to 0.632 and produced materially better plan-level verdicts.

When people talk about agent verification, they usually imagine a simple question:

Did the agent do the right steps?

That sounds reasonable, until you try to verify real plans.

While building ThoughtProof v2, we ran into a failure mode that looks small at first, but turns out to matter a lot:

good agent plans often look incomplete when your verifier assumes every valid plan must be expressed at the same granularity as the reference trace.

That last phrase matters: same granularity.

A human annotator might write a solution as three explicit steps. An agent might compress the same logic into one higher-level plan step. If your verifier only checks one step against one step, the agent can get penalized even when the plan is actually defensible.

That is not just a scoring bug. It is a structural problem.

The failure mode

We built a small paraphrase-heavy benchmark slice to stress-test plan alignment.

At first, the symptoms looked familiar:

some coverage stayed low
some steps looked missing
semantic matching helped a bit, but not enough

The obvious reaction is to keep tuning the matcher: more aliases, looser semantic overlap, more weight on position. We tried parts of that.

It helped at the margins, but it did not solve the real issue.

In fact, one of the most useful intermediate results was negative: once we added a content gate, we saw that many candidate matches were actually just position-only pseudo-matches. They were not evidence of understanding. They were just a verifier being fooled by nearby ordering.

That was clarifying. The bottleneck was not that our filter was too strict. The bottleneck was that our structure was too naive.

What was really happening

In several traces, the agent was not missing the reference reasoning. It was compressing it.

A single agent plan step sometimes faithfully summarized two or three annotator steps. Under one-to-one alignment, that looked like partial support or even a missing step. But from a plan perspective, the agent was often doing something entirely reasonable: expressing a valid subplan at a higher level.

This is where step-level verification starts to break.

If the verifier silently assumes that a good plan must unfold at the same textual granularity as the reference, then it will systematically underrate higher-level but still defensible plans.

The actual fix: segment-aware support

The breakthrough was surprisingly simple.

Instead of asking:

Which single annotator step matches this agent step?

we ask:

Does this agent step support a short contiguous span of annotator steps?

That small shift changes the game.

It lets the verifier recognize that one compressed plan step may legitimately cover a multi-step reference span, as long as the content signal is real.

This is what we call segment-aware plan support. Not bigger models. Not a giant ML system. Just a better structural primitive.

The empirical result

Once we added segment-aware support, the benchmark moved in a way that was too large to ignore.

0.420 single-step semantic coverage

0.632 segment-aware coverage

+0.213 coverage delta

That was the strongest signal from this whole v2 iteration.

It told us two things. First, this was not just a nice intuition. It was measurable. Second, the next layer of agent verification is probably not “better similarity scoring” in the abstract. It is better structural matching.

Why this matters beyond benchmarks

It would be easy to stop here and call this a benchmark improvement. That would miss the point.

Once segment-aware support exists, downstream interpretation changes too. You can start to distinguish between very different cases that naive verification collapses into the same bucket:

a step that is truly missing
a step that was only demonstrated in execution, not in the stated plan
a step that looked missing under one-to-one matching, but is actually covered by a compressed plan segment

Those distinctions matter if you want trustworthy verdicts. They change whether the right answer is ALLOW, CONDITIONAL_ALLOW, HOLD, or BLOCK.

From diagnostics to policy

We pushed the segment-aware layer past pure diagnostics.

It now feeds into:

merged support across plan and execution
risk flags like truly_missing_step and plan_gap
deterministic plan-level verdicts
text and JSON report output
a CLI command for policy evaluation

On the 5-trace paraphrase-heavy set, the first verdict distribution looks like this:

            ALLOW = 2

            CONDITIONAL_ALLOW = 1

            HOLD = 2

            BLOCK = 0

The interesting part is not only the counts. The two ALLOW cases were fully covered and segment-rescued. In other words, they are exactly the kind of plans that would look artificially weak under naive step-level verification.

At the same time, the hardest case stayed a HOLD, with substantial unresolved gaps. So this is not score inflation. It is better discrimination.

The bigger lesson

If agent verification stays at the level of isolated step checks, it will misread a lot of valid higher-level planning.

The missing layer is not only semantic similarity. It is plan structure.

You need systems that can ask:

what subplan is being expressed here?
what span of work does this step actually cover?
is the plan merely compressed, or is it under-supported?
are we seeing a real omission, or just a granularity mismatch?

That is the bridge from basic trace checking to real plan-level verification.

Why this points toward ThoughtProof v2

ThoughtProof v1 is about whether a reasoning trace is defensible.

ThoughtProof v2 extends that into a stronger question:

Is the overall plan, dependency structure, and information flow defensible?

Segment-aware support is not the whole answer. But it is one of the first pieces that makes the shift real.

It gives the verifier a way to stop mistaking compressed plans for incomplete plans. And that feels like one of the central requirements for evaluating serious agent systems.

ThoughtProof v2 is moving from step checks toward plan-level verification.

If you want trustworthy agent systems, you cannot only verify plans at the resolution a benchmark annotator happened to write them down. You have to verify them at the level they are actually expressed.