The Harness Tax: A Boardroom Memo on Why the Wrapper Decides the Security Score

A benchmark published this week showed the same model producing a 26-point functional gap depending on the harness wrapping it. The risk register most engineering orgs carry still names the model and skips the harness, which is now the wrong unit of analysis.

TLDR Endor Labs published an Agent Security League update this week showing the same model produces a 26-point swing in functional correctness depending on which harness wraps it. Most engineering risk registers still name the model and skip the harness, which is now the wrong unit of analysis. The next risk review needs to log the harness and a real verification rate, not a brand. Headline the board saw Last weekend Henrik Plate at Endor Labs posted the latest update to the Agent Security League. GPT-5.5, the model OpenAI shipped on April 23, now holds the top security score on the leaderboard. The headline most outlets ran was “GPT-5.5 wins on security.” Underneath that headline is the part a board has not seen yet. GPT-5.5 inside Cursor scored 87.2 percent on functional correctness and 23.5 percent on security. GPT-5.5 inside Codex scored 61.5 percent functional and 20.1 percent security. Same model. Same week. Two harnesses. About 26 percentage points of functional difference, and a meaningfully different security profile. That is not a model story. That is a harness story. 26pts functional-correctness gap between GPT-5.5 in Cursor and GPT-5.5 in Codex, same week, same model What it actually means I have been in enough risk reviews to know how this gets discussed. Someone says “we are standardizing on Claude” or “we are evaluating GPT-5.5 for the engineering org .” The conversation moves on. The harness, the thing actually executing the model’s output, the thing that decides which files get read, which commands get run, which credentials are in scope, gets treated like plumbing. This week’s benchmark says the plumbing is the product. "Cursor + GPT-5.5 sets a new high on security correctness at 23.5%." Endor Labs Agent Security League update, April 27, 2026 That number sounds low because it is. The benchmark extends Carnegie Mellon’s SusVibes framework, covering 200 real-world tasks across 108 open-source projects and 77 CWE classes. Across the league, even the t

Back to all insights