The Harness Tax: A Boardroom Memo on Why the Wrapper Decides the Security Score

The Harness Tax: A Boardroom Memo on Why the Wrapper Decides the Security Score

A benchmark published this week showed the same model producing a 26-point functional gap depending on the harness wrapping it. The risk register most engineering orgs carry still names the model and skips the harness, which is now the wrong unit of analysis.

TLDR

Endor Labs published an Agent Security League update this week showing the same model produces a 26-point swing in functional correctness depending on which harness wraps it. Most engineering risk registers still name the model and skip the harness, which is now the wrong unit of analysis. The next risk review needs to log the harness and a real verification rate, not a brand.

Headline the board saw

Last weekend Henrik Plate at Endor Labs posted the latest update to the Agent Security League. GPT-5.5, the model OpenAI shipped on April 23, now holds the top security score on the leaderboard. The headline most outlets ran was “GPT-5.5 wins on security.” Underneath that headline is the part a board has not seen yet.

GPT-5.5 inside Cursor scored 87.2 percent on functional correctness and 23.5 percent on security. GPT-5.5 inside Codex scored 61.5 percent functional and 20.1 percent security. Same model. Same week. Two harnesses. About 26 percentage points of functional difference, and a meaningfully different security profile. That is not a model story. That is a harness story.

26pts
functional-correctness gap between GPT-5.5 in Cursor and GPT-5.5 in Codex, same week, same model

What it actually means

I have been in enough risk reviews to know how this gets discussed. Someone says “we are standardizing on Claude” or “we are evaluating GPT-5.5 for the engineering org.” The conversation moves on. The harness, the thing actually executing the model’s output, the thing that decides which files get read, which commands get run, which credentials are in scope, gets treated like plumbing.

This week’s benchmark says the plumbing is the product.

"Cursor + GPT-5.5 sets a new high on security correctness at 23.5%."

Endor Labs Agent Security League update, April 27, 2026

That number sounds low because it is. The benchmark extends Carnegie Mellon’s SusVibes framework, covering 200 real-world tasks across 108 open-source projects and 77 CWE classes. Across the league, even the top performer leaves roughly 8 in 10 outputs with at least one security flaw. The point is not that any single harness has solved code security. The point is that swapping the harness around the same model moves the security and functional numbers more than a model upgrade does.

There is a verification-side story underneath this. A piece in Dev|Journal on April 26, drawing on Sonar’s State of Code survey, framed the gap clearly. 96 percent of developers do not fully trust AI-generated code, and only 48 percent always check it before committing. AI now produces 46 percent of new code. Verification has become a moderate or substantial bottleneck for 59 percent of teams. So the harness is shaping output quality, and the human review layer is not catching what the harness lets through.

That is what a security team is actually managing. Not “AI risk” in the abstract. Two specific control surfaces: the harness that produces the code, and the human who is supposed to verify it.

Key Insight

The two control surfaces that matter for AI code risk are the harness that produces the code and the verification step that follows it. A risk register that names only the model has skipped both.


Three questions the board will ask

When this lands in the next quarterly risk review, three questions will come up if anyone in the room has been reading recent benchmarks.

One. Does the risk register name the harness, or just the model? If governance docs say “we use Claude Opus 4.7” but do not specify whether that runs inside Claude Code, Cursor, Copilot, or a custom integration, the document is undercalibrated. The Endor benchmark this week is the proof. Log harness, not model, as the controlled unit, and tag each surface with the version of the harness that is approved for production work.

Two. What is the verification rate, and how is it measured? “We review AI code” is not a control. “47 percent of merged PRs that contain AI-generated code show evidence of human review touching the AI section, tracked monthly” is a control. Lenny Rachitsky ran a piece on April 27 describing a six-hour autonomous workflow that finished with “zero follow-up prompts, zero steering, and only one approval request.” That is the new normal for individual contributors. A board does not need that scared out of them. It needs to see the measurement that exists alongside it.

Three. What is the swap cost on a harness change, and has the team ever practiced it? Being locked into one harness because nobody has run a switching exercise is not a portfolio strategy. It is a vendor relationship. Run a one-week pilot in an alternative harness, on a contained surface, and measure the security and functional delta on the team’s own code. The Endor benchmark gives a comparison number. The codebase is the one auditors care about.

The model is no longer the unit of risk. The harness is. A risk register that names only the model has stopped describing the system it claims to govern.


60-second brief for the executive readout

The model is not the unit of risk. The harness is. This week’s Endor Labs Agent Security League update showed GPT-5.5 producing meaningfully different functional and security results in Cursor versus Codex. Add the broader verification gap, where 96 percent of engineers do not fully trust AI code and only 48 percent always check it, and two control surfaces need to show up in the next risk doc: the harness, and the verification rate.

Three concrete moves. Update the risk register to log harness, not just model. Define a verification metric the org can report monthly. Run a harness-swap pilot on a contained surface to produce a number for the team’s own code. None of these require new tooling and none of them require a vendor conversation.


What to watch over the next two weeks

Watch for harness-specific security bulletins, not just model updates, because that is where the meaningful variance lives. Watch how engineers describe their work in standups. If “I gave it the prompt and let it run” is showing up more often, the verification metric is the thing the board will ask about next quarter. And watch the next Endor update. If a harness moves 5 points in either direction, that is the prompt to recheck whether the risk register still describes reality.

The good news is that none of this requires a new vendor, a new platform, or a new committee. It requires naming what the team is actually using, measuring what is actually being verified, and writing both into the document the board reads. That is the kind of work a calm engineering org can finish before the next quarterly review.

Sources

  1. GPT-5.5 Sets a New Code Security Record with Cursor, not Codex, in Agent Security League - Endor Labs, 2026-04-27
  2. This week on How I AI: GPT 5.5, Claude Design, and GPT Images 2.0 hands-on reviews - Lenny's Newsletter, 2026-04-27
  3. Vibe Coding Audit Failure: 96% of Developers Distrust AI-Generated Code - Dev|Journal, 2026-04-26
  4. Governing Claude Code: Mitigating Risks of Autonomous Enterprise Production Deployments - Dev|Journal, 2026-04-25

Back to all insights