When Amazon is mandating AI code review, who actually owns the merge?

AI code review is shifting from a courtesy to an enforced gate. Amazon made senior sign-off mandatory, the harnesses are starting to block irreversible actions on their own, and the open question for engineering leaders is who owns the merge.
AI code review just stopped being a polite step and started becoming an enforced gate. On June 19 Claude Code shipped a release that blocks an agent from running destructive git and infrastructure commands the human never asked for. Amazon already made senior sign-off mandatory after AI-assisted changes took down its shopping site. The tooling keeps getting faster and cheaper, but speed was never the missing piece. The missing piece is a named person who owns the merge.
On June 19 the Claude Code changelog carried a line that looked small and meant a lot. In auto mode, the agent will now refuse to run git reset --hard, git checkout -- ., git clean -fd, or git stash drop when nobody asked it to throw away local work. It will refuse git commit --amend on a commit it did not make this session. And it will refuse terraform destroy, pulumi destroy, or cdk destroy unless the specific stack was named out loud (per the v2.1.183 notes via DevelopersIO, June 19).
That change exists because of a real incident. Someone filed a bug, twice, after Claude Code ran git reset --hard on session startup and quietly erased their uncommitted work. The harness now treats a class of irreversible actions as something it is not allowed to do on its own authority.
Read that again, because it is the whole story this week. The tool is starting to enforce the boundary that a reviewer used to enforce by hand. Which raises the question every engineering leader is now holding: if the harness handles the obvious destructive stuff, who owns the part that still requires judgment?
The week the harness started saying no on its own
For two years the model in most orgs has been generous-by-default. The agent could touch the repo, run the commands, open the PR, and a human would catch anything dangerous downstream. That worked when the agent wrote a function. It strains when the agent writes most of the change set and proposes its own merge.
The v2.1.183 release is the first time I have seen a mainstream harness draw a hard internal line: certain actions are irreversible, so the agent does not get to take them unsupervised. This is the same instinct behind Amazon’s policy, just expressed in code instead of process.
And the policy half of that pair is worth sitting with. Earlier this year Amazon told junior and mid-level engineers they must get a senior engineer to sign off before shipping AI-assisted code changes to production. That was not a culture memo. It followed a string of high-blast-radius incidents: a 13-hour outage in late 2025 after its Kiro agent auto-deleted and recreated an AWS Cost Explorer environment, and a roughly six-hour shopping-site outage on March 5 that peaked at 21,716 Downdetector reports, with a subsequent disruption reportedly costing around 6.3 million orders (per coverage aggregated by Awesome Agents, March, and reporting in TechRadar and Tom’s Hardware). The same company that set an 80% weekly usage target for its agent also decided the output needed a human gate before it touched production.
So the two halves arrived from opposite directions and met in the middle. The harness vendor is hard-coding refusals. The hyperscaler is hard-coding approvals. Both are saying the same thing: review is no longer optional, and somebody has to own it.
Why AI code passes review and fails in production
Here is the part that surprised me when I dug into the data. The problem is not that AI code reviews badly. It is that AI code reviews beautifully and ships poorly.
New Relic put numbers on it in its 2026 State of AI Coding report, published June 10 (a Hanover Research survey of 200 US technology decision-makers, manager level and above). Ninety-four percent of leaders rate AI-generated code as higher quality than human-authored code at the moment of review. Then it ships. Seventy-eight percent report more production incidents. Eighty-six percent report senior staff spending more time fixing code. Eighty-two percent had at least one production failure tied to AI-generated code in the past six months, and 74% say at least a quarter of their AI code needs significant rework over a year.
"94% of leaders rate AI-generated code as higher quality than human-authored code at the time of review... 78% of respondents report more incidents... 86% report an increase in time senior staff spends fixing code."
New Relic has a name for the gap between those two numbers. They call it “agent debt”: the quiet accumulation of unvetted architectural logic that looks clean in the diff and breaks in production. Their chief technical strategist, Nic Benders, framed the scale of it by noting that 67% of technology leaders say AI now generates or significantly refactors between 51% and 75% of their weekly code output.
That is the trap. Code review evolved to catch the kind of mistakes humans make: a typo, a missed edge case, a sloppy abstraction. AI makes a different kind of mistake. It produces something that reads as correct, passes the reviewer’s eye, satisfies the tests, and then carries a structural assumption nobody verified into a system where it has consequences. The review stage is exactly where this slips through, because the review stage is where it looks best.
AI-generated code passes review and fails in production, which means the value of a reviewer is no longer spotting ugly code. It is verifying logic that was engineered to look right.
Where the tooling helps and where it quietly does not
The AI-powered code review tools got genuinely better this month, and I want to give them full credit before I complicate the picture. Cursor’s Bugbot update cut average review time from about five minutes to roughly 90 seconds, with 90% of runs finishing under three minutes, finding about 10% more bugs at 22% lower cost per run, and adding a pre-push /review command (Digital Applied, June 10). GitHub’s Copilot CLI has a /security-review pass. Claude Code dispatches multiple reviewer subagents on a PR. The ai code review tools available to a team in mid-2026 are faster, cheaper, and more thorough than the ones from six months ago.
But look at what improved. Speed improved. Cost improved. Bug recall improved a little. The thing that did not improve is the part that actually protects production: the decision to merge.
A faster ai code review that finds 10% more issues is still a reviewer that hands back a list. Someone has to read the list, decide which findings are real, decide whether the change is safe in context, and put their name on the merge. CodeRabbit’s own analysis last December found AI-co-authored pull requests carried roughly 1.7 times more issues than human-only ones. Faster review of code that carries more defects per change is not a smaller problem. It is the same judgment problem, arriving more often and more confidently.
This is where I see teams quietly fool themselves. They adopt a github ai code review bot, watch the PR comments pile up, and feel covered. The bot is not the gate. The bot is a smarter pair of eyes feeding the gate. The gate is still a human decision, and if that decision is not assigned to anyone specific, it is assigned to everyone, which means it is assigned to no one.
The bot is not the gate. The bot is a smarter pair of eyes feeding the gate. And a gate nobody owns is just a door.
Review became a control, not a courtesy
Step back and the shape is clear. Three independent actors, three different mechanisms, one conclusion.
Anthropic encoded it in the harness: irreversible actions require explicit human intent. Amazon encoded it in policy: AI-assisted production changes from less-senior engineers require a senior signature. New Relic measured the cost of not having it: an incident-and-rework tax that lands squarely on senior staff. None of them coordinated. They converged because the underlying reality is the same. When an agent can author most of a change and propose its own merge, the merge decision becomes the most important control surface in the pipeline, and it cannot be left implicit.
The healthy version of this is not heavier process. It is clearer ownership. The teams I see handling it well did three unglamorous things. They named a specific person who owns the merge decision for each service area, not a rotation and not a channel. They moved their metric from review speed to verified merged output, so the number on the slide rewards correctly shipped change rather than fast approvals. And they let the harness enforce the irreversible-action floor, like the v2.1.183 refusals, so the human reviewer spends judgment on logic and context instead of on catching a stray git reset --hard.
That last point matters more than it looks. The guardrails the harness now enforces are not a replacement for the reviewer. They are what frees the reviewer to do the part only a human can: decide whether this change, in this system, on this day, is safe to own.
Before the next planning cycle, name who signs the merge
For an engineering leader, here is the honest read. The “ai code review” conversation has quietly stopped being about which tool finds more bugs. Every serious option finds plenty. It is now about who signs the merge, and whether that person has the time and the standing to say no.
So before the next planning cycle, do one thing. Open the highest-traffic repo and answer a single question out loud: when an AI-assisted change is ready, whose name is on the decision to merge it? If the answer is a tool, the team has a smarter reviewer and an unowned gate. If the answer is “whoever is around,” that is Amazon’s pre-mandate setup, one high-blast-radius change away from learning why they wrote the policy.
The good news is that this is figure-out-able, and it does not require a reorg. It requires a name. Amazon needed a costly outage to learn that. The harness vendors are now building the floor for free. A leader gets to put the name on the gate before anything breaks, which is a far more pleasant way to arrive at the same correct answer.
Sources
- Claude Code v2.1.182 to v2.1.183 Major Updates - DevelopersIO, 2026-06-19
- New Relic Report Reveals AI-Generated Code Grades Higher in Review, Yet Triggers Rise in Production Incidents - New Relic, 2026-06-10
- Amazon Mandates Senior Approval for AI-Assisted Code - Awesome Agents, 2026-03-10
- Cursor Bugbot Reviews in 90 Seconds: The June Update - Digital Applied, 2026-06-10