Opus 4.7, four days in: the lift is real and the rate card lies

2026-04-20

Four days into Opus 4.7, the productivity gains show up in measured team data. The headline price is unchanged but the new tokenizer silently expands code by up to 35%. Here is what engineering leaders should actually measure this week.

TLDR

Four days into Opus 4.7's rollout, the productivity gains are showing up in measured team data. Notion logged a 14% improvement on multi-step workflows and a 66% drop in tool-calling errors on their internal benchmark. The headline API price is unchanged from 4.6, but the new tokenizer silently expands code and JSON by up to 35%. CFOs will see the variance before engineering leaders report the wins.

The setup

Four days ago Anthropic shipped Opus 4.7 into general availability. Every major harness plugged it in within hours. Claude Code, Cursor, GitHub Copilot, Codex, a handful of smaller ones. By nightfall on April 16 they were all running the same model weights.

I spent the weekend reading what actually happened once the confetti settled. The answer turned out to be more interesting than the launch post and more useful than the Twitter threads.

Two things are true at the same time. The productivity lift is measurable, not marketing. And the rate card did not move, but the bill did.

If an engineering org just signed a Claude contract last quarter, both of those facts matter this week.

What they tried

The most credible early measurement comes from Notion. On April 17 Karolina Zieminski published their internal benchmark results alongside a working review of the model.

"Notion is reporting a 14% improvement in multi-step workflows and a 66% drop in tool-calling errors on their internal benchmark."

Karo Zieminski Substack, April 2026

Those are the two numbers that actually move delivered work. Multi-step success is how long an agent can run before a human steps back in. Tool-calling errors are where agents burn context and tokens recovering from a fumble. When both improve together, you get longer autonomous runs that cost less per successful outcome, which is the unit economics that keep agentic workflows alive past a pilot.

The lift is real and the rate card lies. Both of those show up in the same week.

On April 18 the PM Hamza Farooq posted his first real working session with 4.7 on BoringBot. His line stuck with me: “4.7 catches its own reasoning errors mid-output. 4.6 doesn’t. For multi-step workflows, that matters more than surface polish.” That is the self-verification behavior Anthropic advertised in the release post, now showing up in somebody’s Tuesday morning prompt.

On the harness side, the vendor numbers line up. Cursor ran 4.7 against 93 real software engineering tasks on their internal bench and solved 70%, including four tasks that neither 4.6 nor Sonnet 4.6 could crack. CodeRabbit’s code review harness improved recall by more than 10% on hard bugs inside complex PRs, with precision stable. Rakuten reported 4.7 resolves three times as many production tasks as 4.6 in their internal tests.

The thread connecting these numbers is not speed. It is that the model finishes work it used to drop. That is a real change in the shape of the senior engineer’s workday. It is also why the rest of this post matters.

Where it broke

Here is where it gets quietly expensive. Tokencost.app published an analysis on April 17 with the line most execs missed. The new Opus 4.7 tokenizer can produce up to 35% more tokens for the same code or JSON input. Erik van Klinken at Techzine independently confirmed the same multiplier on the same day, calling out “an updated tokenizer that can map the same input to roughly 1.0 to 1.35 times more tokens.”

1.0 to 1.35x

tokenizer expansion on identical input, independently reported by Tokencost.app and Techzine on April 17

The API rate card is unchanged. $5 per million input tokens, $25 per million output tokens. And yet a request that cost ten cents on 4.6 can cost thirteen and a half cents on 4.7 without anyone at the company touching a model picker. Code-heavy workloads feel the full effect. English prose stays roughly flat.

Layered on top of the cost question, first-48-hour reception was mixed enough that a Business Insider aggregation republished on DNYUZ captured both camps in the same paragraph. One quote: “Opus 4.7 is burning through tokens like nobody’s business, but it’s gooooooooood.” Another, from someone who felt burned: “Claude Opus 4.7 is a serious regression, not an upgrade.” Anthropic pushed patches inside 24 hours. Some regressions went away. Some stayed.

There is an escape hatch, and it matters. Tokencost noted that batch mode combined with prompt caching pulls the effective input cost down to roughly $0.25 per million tokens. Their phrasing: “At that level, the tokenizer overhead is nearly irrelevant.” That is a real lever. It is also a lever most engineering teams have not pulled yet, because it requires orchestration work and nobody had a forcing function when tokens were cheaper per request.

The pattern

Here is the pattern I keep watching across model transitions. Benchmark lift does not equal delivered productivity. The gap between the two is the work your team does to capture the lift. Right now, that work is invisible to most dashboards.

Key Insight

The productivity question this week is not "is 4.7 better." It is "is the organization set up to capture the lift before the tokenizer tax eats it."

Three things are worth measuring before the April invoice lands:

Tokens per successful merge, not tokens per request. A 14% lift in multi-step completion that costs 30% more per request is a drag on throughput economics. The same 14% lift that costs the same per delivered merge is a pure win. Finance cannot do this math alone. Your VP of Engineering has to run it and bring the number to the CFO before the renewal conversation starts.

Prompts written for 4.6, behaving differently on 4.7. Anthropic has been explicit that prompts tuned for earlier models can sometimes produce unexpected results on 4.7. Teams that do not catch the drift inside the next two weeks will watch the productivity lift get eaten by quiet instruction regressions nobody flagged in the release notes. The fix is usually small. The miss is expensive.

Batch plus cache adoption rate. The tokenizer expansion nearly disappears for teams running batch mode with prompt caching and stays stubbornly visible for teams that do not. If the honest answer inside the org is “nobody set that up yet,” that is the lowest-effort cost lever available this quarter, and it has nothing to do with the model itself.

What I’d tell you over coffee

If a CTO pinged me this week asking whether to sign off on an Opus 4.7 rollout, the answer is yes, with one line added to the rollout plan. Re-baseline token spend per engineer inside the first week, then again at day thirty. The model is genuinely better at finishing work. It also costs more per request than its rate card implies. Neither number requires panic. Both require looking.

The teams that will get burned are the ones that read “same price as 4.6” and stopped reading.

Sources

The Claude-lash is here: Opus 4.7 is burning through tokens and some people's patience - DNYUZ (Business Insider origin), 2026-04-17
Claude Opus 4.7 Review: What It Really Means for Your Work (2026) - Karo Zieminski Substack, 2026-04-17
Claude Opus 4.7 pricing: $5/1M, new tokenizer explained - Tokencost.app, 2026-04-17
Claude Opus 4.7 is no Mythos, and that's a good thing - Techzine Global, 2026-04-17
Claude Opus 4.7, Here's what works and what doesn't, A PM Perspective - BoringBot Substack, 2026-04-18

Back to all insights