How long does a coding-agent harness evaluation stay valid?

In 72 hours this week, Cursor shipped its own model and Google relaunched Antigravity as a platform. Here is how to build a coding-agent harness evaluation that survives the release cadence instead of expiring with it.
In one 72-hour window this week, Cursor shipped its own coding model and Google relaunched Antigravity as a full agent platform. Any harness shortlist built more than a few weeks ago now measures products that have already changed. The fix is not another bakeoff. It is a standing evaluation: a fixed task set, a rubric that scores harness properties instead of benchmark numbers, a named owner, and a clear re-trigger.
The problem this solves
I talked to an engineering leader last week who ran a clean coding-agent bakeoff in early April. Three harnesses, a scoring sheet, a decision memo, the whole thing. Good work. The memo is now six weeks old, and it describes products that no longer exist.
Here is what happened in the three days before this article. On May 18, Cursor shipped Composer 2.5, its own in-house coding model, which The Decoder covered as matching Opus 4.7 and GPT-5.5 on headline benchmarks. On May 19, at Google I/O 2026, Google relaunched Antigravity as a standalone agent-first platform with a CLI, an SDK, managed execution, and enterprise support, per Google’s own developer keynote blog and coverage in SiliconANGLE.
So the April memo has two stale rows. The Cursor row scored a model that no longer runs, and the shortlist never included Antigravity, because in April it was a minor IDE feature rather than a platform. The evaluation was run once and filed, the way a procurement decision gets filed. The thing it evaluated has the half-life of a software release.
The approach
The instinct after a week like this is to run another bakeoff. Resist it. A bakeoff you run once is guaranteed to expire on the same schedule as the last one. What survives the release cadence is a standing evaluation, something cheap enough to re-run on demand. Here is how to build one.
-
Freeze a task set, not a benchmark
Pull eight to twelve real tasks from your own closed pull requests and backlog. A failing test to fix, a small feature, a gnarly refactor, a dependency bump. These become the fixed exam, and every harness sits the same exam every time. Public benchmarks like SWE-bench drift and get gamed. A slice of your own backlog does not.
-
Score the harness, not the model
Build a short rubric of durable properties: the permission and sandbox model, the CLI and SDK surface, observability and audit trail, admin and governance controls, and whether there is a real kill switch. The model inside is a swappable, vendor-controlled variable. Score the box, not its current contents.
-
Name one owner
One person owns the rubric and owns the re-run. Not a committee, not a rotating tiger team. The evaluation is now a maintained instrument, and instruments need a named keeper or they quietly rot in a wiki.
-
Define the re-trigger
The evaluation re-runs on an event, not a calendar date. A model swapped under a harness already in use. A major version release. A new entrant clearing the capability bar. Antigravity 2.0 is exactly that kind of trigger, and it should have fired a re-run this week.
-
Keep it cheap to re-run
The whole pass should cost a focused day, not a quarter. If a re-run is expensive, it will quietly never happen, and the field guarantees it needs to happen often. Cheap and repeatable beats thorough and abandoned.
Why most teams get this wrong
The mistake that looks smartest is anchoring the whole evaluation on benchmark numbers. SWE-bench scores. Terminal-Bench results. The vendor line that a new model “matches Opus 4.7.” Numbers feel rigorous, so they end up as the headline of the decision memo.
They are also the most perishable thing in the entire evaluation. Composer 2.5’s benchmark figures are vendor-reported, and they describe a model that is one release away from changing again. Anchoring a twelve-month tooling commitment on a number with a two-month shelf life is how teams end up surprised.
Here is the tell that the ground shifted. The harness-versus-model split is now the vendors’ own vocabulary. Google’s keynote post describes its new SDK as giving programmatic control over the Antigravity agent harness, and its Managed Agents as delivering that same harness through the Gemini API. When the vendor itself tells you the model is a swappable component, scoring the model is scoring the wrong unit.
The durable signal is not how well a harness scored last week. It is how much of its behavior an engineering org can see, gate, and reverse. Sandbox model, permission surface, audit trail, kill switch. Those properties survive a model swap. A benchmark score does not.
The numbers
The figures worth holding from this week, all carried as vendor claims rather than independent findings: Composer 2.5 reports 79.8 percent on SWE-Bench Multilingual, with standard-tier pricing at $0.50 per million input tokens and $2.50 per million output. Cursor says it trained the model on twenty-five times more synthetic tasks than its predecessor.
"85 percent of the compute budget went toward extra training and reinforcement learning."
Notice what every one of those numbers has in common. They describe the model, they come from the vendor, and they will all be different by the next release. That is precisely why they belong in the re-score-every-run column, never in the anchor-the-decision column.
| Transient signals (re-score, never anchor) | Durable signals (anchor here) |
|---|---|
| Benchmark scores | Permission and sandbox model |
| Vendor pricing tier | Observability and audit trail |
| Current model version | Governance and admin controls |
| "Matches Opus 4.7" claims | Exit cost and kill switch |
Ship it
Monday morning, the move is small. Pull ten tasks from a recent sprint. Write the six-row harness rubric on durable properties. Name the owner. Run it once against whatever harnesses are in use today, and that first pass becomes the baseline.
A bakeoff tells you which harness won last month. A standing evaluation tells you whether the one running today still deserves the seat.
The point was never to re-pick a winner every week. That would be exhausting, and frankly nobody has the time. The point is to never again be caught explaining to a board why the tool named in the April memo is not the tool the team actually runs.
A week where two vendors reshaped the field in seventy-two hours is not a crisis. It only feels like one when the evaluation is an event instead of a habit. Make it cheap, give it an owner, and a week like this stops being a scramble. It becomes a scheduled half-day, and then it becomes boring. Boring, for an engineering leader in 2026, is a quietly wonderful place to be.
Sources
- Cursor's Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of the cost - The Decoder, 2026-05-18
- All the news from the Google I/O 2026 Developer keynote - Google Developers Blog, 2026-05-19
- Google Launches Antigravity 2.0 at I/O 2026: A Standalone Agent-First Platform with CLI, SDK, Managed Execution, and Enterprise Support - MarkTechPost, 2026-05-19
- With expanded Antigravity platform, Google accelerates agent-native software development - SiliconANGLE, 2026-05-19