Four Questions to Answer Before Locking a Team Into One Coding Harness for Twelve Months

2026-04-19

Anthropic's Opus 4.7 release and GitHub's cross-vendor skills spec landed the same day, which changes what a twelve-month harness commitment actually commits to. Four questions for engineering leaders to work through before the default becomes the decision.

TLDR

Anthropic shipped Opus 4.7 and GitHub shipped a cross-vendor agent-skills spec inside the same 48 hours. If a twelve-month harness commitment is sitting on the desk, the question is no longer which tool has the best model. It is where the team's instructions, review pipeline, and token budget live when the underlying model shifts again in July.

Problem this solves

A VP of Engineering messaged me this week asking whether to standardize her 120-engineer org on Claude Code, Cursor, or GitHub Copilot’s agent mode for the next fiscal year. Her CFO wants one line item. Her staff engineers each have a different favorite. She has four weeks to decide.

This is the wrong question asked with the right urgency. The right question is not which harness to pick. It is which four sub-decisions the harness choice is actually made of, so the argument can move to the one that matters.

The approach

April 16 was a busy day. Anthropic released Claude Opus 4.7. GitHub rolled it out to Copilot Pro+, Business, and Enterprise the same morning. Anthropic added a new /ultrareview command inside Claude Code. Cursor posted CursorBench numbers. OpenAI shipped a major Codex update that lets it drive a Mac desktop. And GitHub quietly released a spec for installing agent skills that works across Copilot, Claude Code, Cursor, Codex, and Gemini CLI.

That last one matters more than the others. I will come back to it.

Here are the four questions I would walk through with staff engineers before committing.

Question 1: Where does the engineer actually spend the thinking hour?

Not the typing hour. The thinking hour. The one where they are reading code, staring at tests, debating an approach with the agent.

If that hour happens in the terminal, the right primitive is a CLI agent. Claude Code, Codex CLI, Aider. If it happens in an editor, an IDE-native harness fits. Cursor, Windsurf, Copilot agent mode. If it happens in PR review, an async reviewer does more work. Copilot code review, CodeRabbit, the /ultrareview pattern inside Claude Code.

Most orgs have all three in the stack. The disciplined ones pick one primary per engineer tier and let the other two stay tactical.

Question 2: Is the instruction layer portable?

This is the question GitHub just changed the answer to. On April 16 they shipped gh skill, a command for managing agent skills across the whole landscape. The spec works in Copilot, Claude Code, Cursor, Codex, Gemini CLI. One format, six hosts, version-pinned and content-addressed.

The JLS42 newsletter called it “a common infrastructure that could become the npm install of the agentic world.” That is the right frame. A team’s accumulated prompt work, house rules, redacted-data handling, deploy runbook for agents. If those live inside one vendor’s walled garden, a twelve-month lock-in is twenty-four months real.

Pick a harness whose instructions travel.

Question 3: Who verifies the agent’s work, and where does that verification live?

Opus 4.7 ships with self-verification baked in. Verdent’s Rui Dai wrote up an evaluation on April 17 with a line I kept rereading: “The model writes tests, runs them, and fixes failures before surfacing results to the orchestrator.”

That sentence redraws the review pipeline.

If the harness runs verification inside the session, CI becomes a second net instead of the first one. Code review shifts from “does this work” to “does this belong.” The /ultrareview command Anthropic added is a preview of what that looks like: a dedicated review session that flags bugs and design issues before a human ever opens the PR.

Great feature. Also a lock-in moment. If the review quality is coupled to the harness, switching next year is not a tool migration. It is a process migration.

Question 4: What is the per-engineer cost curve when the underlying model upgrades mid-year?

This is the one almost no one models out.

Rui Dai’s writeup noted that “The new tokenizer produces roughly 1.0 to 1.35x more tokens for the same input text.” GitHub’s changelog said Opus 4.7 ships with “a 7.5x premium request multiplier as part of promotional pricing until April 30th.” Hex told Dai that “low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6.”

Read those three lines together. The price per token held flat. The tokens per request went up. The effort dial defaults reset. And the multiplier on the base price moves depending on which harness hosts the request.

Per-engineer monthly spend is a moving target even when no tool ever changes.

1.35x

more tokens per request under Opus 4.7's new tokenizer, before any usage change

Why most teams get this wrong

Two mistakes I see repeatedly.

First, teams run the bakeoff on features instead of on the four questions above. They pick five tasks, time them across three tools, and declare a winner. That tells them which tool was best at those five tasks last month. It tells them nothing about what happens when Opus 4.8 ships in July and the tokenizer shifts again.

Second, teams treat the harness decision as a tool procurement when it is really a workflow commitment. Every harness embeds opinions about where instructions live, who reviews, and what gets automated. Those opinions become the team’s opinions inside of a quarter.

Pick the workflow first. Pick the harness that hosts that workflow with the least fight second.

"SWE-bench Verified jumped from 80.8% to 87.6%, CursorBench climbed from 58% to 70%."

Verdent Guides, Rui Dai, April 17 2026

Those are real improvements. They are also not the number that predicts renewal at month twelve. The number that predicts renewal is whether an engineer can move a skill from one harness to the next in an afternoon instead of a quarter.

The numbers

A 120-engineer org on a mid-tier harness at roughly $100 per seat per month is a $144,000 annual commitment. The token bill on top of that sits somewhere between $200 and $2,000 per engineer per month depending on usage pattern. At the low end, that is another $288,000 annually. At the high end, it is $2.88 million.

The gap between those two numbers is determined by three things. Which harness got picked. How it defaults the effort dial. And how the underlying model’s tokenizer treats the codebase.

None of the three show up on the pricing page.

Key Insight

The twelve-month harness decision is not a tool choice. It is a decision about where instructions live, who runs verification, and how much tokenizer drift the budget can absorb before renewal. Pick for those three dimensions, not for this month's benchmark.

Ship it

Here is the four-week decision process I would run for a mid-sized engineering org.

Week 1: Inventory the instruction layer
List every prompt template, house rule, and deploy runbook the team has accumulated. Fewer than ten means the team is already locked in by default. More than fifty means the portability question is a real one. Either way, map them to the GitHub skills spec format to see what survives a harness migration.
Week 2: Run three candidate harnesses against one real task per engineer tier
A staff engineer doing a hard refactor. A senior doing a feature. A junior doing a small bug. Same task, same codebase, three harnesses. Measure cycle time, review effort, and tokens burned. No winner declared yet.
Week 3: Model the twelve-month cost curve under two scenarios
Scenario A: the underlying model gets replaced mid-year with a new tokenizer. Scenario B: the harness vendor raises the premium multiplier after promotional pricing ends. If either scenario breaks the budget, the harness choice has to absorb that risk instead of the CFO.
Week 4: Commit to the workflow, pilot the harness for six months
Write the workflow first. Where instructions live. Who reviews. What gets automated. Then commit to a harness for six months, not twelve. Six months clears the next model upgrade cycle with real data instead of vendor marketing.

The teams that end 2026 looking smart will not be the ones who picked the best harness in April. They will be the ones who picked the most portable one.

Sources

Introducing Claude Opus 4.7 - Anthropic, 2026-04-16
Claude Opus 4.7: What Changed for Coding Agents (April 2026) - Verdent Guides, 2026-04-17
Claude Opus 4.7 available, Codex moves to computer use on macOS, OpenAI launches GPT-Rosalind - JLS42, 2026-04-16
Claude Opus 4.7 is generally available - GitHub Changelog, 2026-04-16

Back to all insights