Who's Checking the AI Coding Productivity Number?

Who's Checking the AI Coding Productivity Number?

Two stories from this week, NVIDIA shipping GPT-5.5 Codex to 10,000 employees and Anthropic admitting Claude Code silently regressed for over a month, point to the same executive question. The productivity number is only as real as the person checking it.

TLDR

Two stories crossed the wire this week. NVIDIA put GPT-5.5-powered Codex in front of 10,000 employees on day one. Anthropic admitted Claude Code had quietly degraded for over a month before an outside engineer at AMD audited the session logs and surfaced it. The executive lesson is the same in both: the productivity number is only as real as the person inside the company who is checking it.

Two stories crossed my feed this week, and they are really the same story.

NVIDIA put GPT-5.5-powered Codex in front of more than 10,000 employees on day one of the model’s release. Engineering, product, legal, finance, marketing, HR, sales, operations. The internal quotes are exactly what gets quoted from a workforce that just got handed a faster engine. “Mind-blowing.” “Life-changing.” “Debugging cycles that once stretched across days are closing in hours.” That is TechRadar’s read on April 24, citing the NVIDIA deployment numbers and the GB200 NVL72 economics that make it work.

Forty-eight hours earlier, Anthropic published a postmortem admitting that Claude Code had been quietly degraded for more than a month. Three separate product changes, shipped between March 4 and April 16, layered on top of each other to silently lower output quality across paying users. The postmortem only landed after an outside engineer at AMD ran the audit nobody inside Anthropic had been running.

Same story. The productivity number is only as real as the person checking it.


The two scoreboards on the same desk

Imagine the CEO of a 1,200-person engineering org whose CTO walks in with both articles printed out. What is the right move?

Most exec teams handle the NVIDIA story easily. The numbers are clean. NVIDIA is the platform, OpenAI is the workload, Codex is the surface. “Over 10,000 NVIDIANs - across engineering, product, legal, marketing, finance, sales, HR, operations and developer programs - are already using GPT-5.5-powered Codex.” Straight from the company. The 35x and 50x infrastructure numbers come from NVIDIA’s own platform team. The “debugging cycles closing in hours” line is repeated across TechRadar and TweakTown coverage on April 24, both of which trace it back to the same NVIDIA disclosure. Reasonable people can debate whether the deployment generalizes outside NVIDIA’s own stack. Nobody can debate that NVIDIA shipped this in a way that someone, somewhere, will eventually be able to grade.

The Anthropic story is the other half of the same conversation, and it does not get printed in board decks. From March 4 onward, Claude Code’s default reasoning effort dropped from high to medium. Three weeks later, a caching change meant to speed up resumed sessions started clearing the model’s working memory every single turn. Then on April 16, a new system prompt told the model to be brief, and a single line of that prompt cost about 3% of coding-eval performance on both Opus 4.6 and 4.7. None of these were caught by Anthropic’s internal evaluations. They were caught by an AMD senior director named Stella Laurenzo who pulled 6,852 session files and 234,760 tool calls and graphed them.

"Files read before editing had dropped from 6.6 to 2.0. Median visible thinking length had collapsed from roughly 2,200 characters to about 600."

Implicator.ai, citing AMD's Stella Laurenzo audit, April 2026

That is the audit that should have come from the vendor. It came from a customer.


What both teams actually measured

Here is what I find genuinely interesting about these two stories sitting next to each other.

NVIDIA, telling a positive productivity story, measured deployment scale and unit economics. They reported how many people had access, what the inference cost looked like on Blackwell, and they let employees self-report time saved. That is fine for a launch announcement. It is not enough for a P&L.

Anthropic, with the negative productivity story, also measured the wrong thing. Their internal evaluations did not catch the regression. The signal that finally surfaced the problem was a custom audit by a customer, looking at metrics the vendor was not even publishing: average files-read-before-edit, median thinking length, tool-call counts. BuildFastWithAI’s GPT-5.5 review on April 24 made the same point from the opposite direction, noting GPT-5.5 “uses approximately 40% fewer output tokens to complete the same Codex task as GPT-5.4.” Exactly the kind of number a vendor is happy to publish. Also exactly the kind of number that does not survive a real-world session audit unless somebody is comparing pre and post.

6,852
session files audited by AMD's Stella Laurenzo to surface a Claude Code regression that Anthropic's internal evals missed for 38 days

Both teams told a productivity story. Neither team measured it the way an outside auditor would.


Where the gap actually lives

If I were on the call, the question I would push on is not “are coding agents working.” The data says they probably are, on average, for most teams using them seriously. The question is closer to this: who inside the company is responsible for noticing when the number quietly moves the wrong way?

In most companies I see right now, the answer is nobody. The harness gets procured by IT or DevEx. The seats roll out through engineering. The bill goes to finance. The metrics, when they exist, sit inside the vendor dashboard, which is a polite way of saying the vendor grades its own homework. AMD found the Claude Code regression because Stella Laurenzo had built her own session-replay pipeline and ran it monthly. Nobody asked her to. Nobody at Anthropic could have. That is the single most important detail in the postmortem.

The AMD audit was not exotic. Files read before editing, thinking length, tool-call distribution. These are not novel metrics. They are just metrics that need a human, inside the company, who has agreed to track them on a cadence and raise a flag when they drift.


The pattern

Here is the pattern I keep seeing across 2026 deployments, and these two April 24 stories make it impossible to ignore.

Vendor-reported productivity is a marketing artifact. Not wrong, exactly. Incomplete. The numbers a harness vendor hands over are the numbers that look good in a launch post. Token efficiency. Active users. Sessions per week. Time-to-first-edit. Real numbers. Also chosen ones.

Customer-side productivity is a finance artifact. It is the number that survives an audit. Cycle time on AI-assisted PRs, defect rate on AI-touched code, incident rate post-deploy, cost per merged change. Most companies do not have these numbers because nobody owns them.

The gap between those two scoreboards is where regressions hide for thirty-eight days while paying customers complain in Reddit threads and the vendor’s own evals miss them. The gap is also where over-reporting happens, because the easiest way to look like a company has a productivity story is to point at vendor numbers nobody is grading.

Key Insight

The companies that will compound advantage from coding agents this year are not the ones picking the right harness. They are the ones who know who in the org is reading the session logs.


What I would tell you over coffee

Three small moves before the next board update.

One. Pick a single weekly metric engineering owns, not the vendor. Cycle time on AI-touched PRs is the easiest. Merged-change cost, if the finance partner is sharp. Pick one. Track it for eight weeks. The point is not the number. The point is having a number nobody outside the company can change without telling you.

Two. Name a person whose job it is to notice. It does not need a new headcount. It needs a sentence in someone’s existing job description. “Owns weekly review of AI-coding session metrics, flags drift to engineering leadership.” That is the whole deliverable. AMD did not have a coding-agent observability team. They had a senior director with a script.

Three. Run the AMD thought experiment in the next staff meeting. If the coding agent quietly got 5% worse for a month, the way Claude Code just did, what happens next? If the answer is “we would notice,” ask how. If the answer is “the vendor would tell us,” that is the work item.

The good news in both of these stories is that the productivity gain is real. NVIDIA is not making up its 10,000-employee rollout. Anthropic is not making up the speed gains that drove $2.5 billion in Claude Code revenue in a year. The productivity is there. It is just under-audited. And under-audited productivity is how a clean launch becomes a quiet drift becomes a board question nobody can answer.

The fix is not exotic. It is just deciding that the audit is the job.

Sources

  1. GPT-5.5 Review: Benchmarks, Pricing & Vs Claude (2026) - BuildFastWithAI, 2026-04-24
  2. OpenAI deploys GPT-5.5 Codex across NVIDIA Blackwell systems: 50x efficiency boost and 35x cost reduction makes AI viable at enterprise scale - TechRadar Pro, 2026-04-24
  3. NVIDIA deploys GPT-5.5-powered Codex to 10,000 employees, with engineers calling results 'mind-blowing' - TweakTown, 2026-04-24
  4. Anthropic Explains the Claude Code Quality Drop: Here Is What Actually Happened - DevToolPicks, 2026-04-24
  5. Anthropic Pins Claude Code Quality Drop on Three Changes - Implicator.ai, 2026-04-24

Back to all insights