How to run a 5-day harness governance pilot before June 1

A 5-working-day governance-plane pilot an engineering leader can run before the June 1 GitHub Copilot AI Credits activation, designed to produce the four artifacts a CFO will actually sign.
GitHub Copilot's AI Credits billing activates on June 1, which leaves engineering leaders five working days to convert a coding-agent feature bakeoff into a governance pilot. Every major harness shipped the primitives the pilot needs to test in the last two weeks. The week produces four artifacts a CFO will actually sign: a token-tier policy, an MCP allowlist, a kill-switch runbook, and a config-integrity check.
A CTO I was talking to last Wednesday had a slide in her hand that listed three coding-agent harnesses, a productivity benchmark, and a recommendation to standardize on one. The slide was good. It was also wrong for the conversation that starts on June 2.
Because GitHub Copilot’s AI Credits billing activates on June 1, and the question her CFO will ask the morning after is not which harness writes the cleanest test fixture. It is whether the harness she picked has an off switch, a budget, and an audit trail that can survive a Tuesday.
That five working day gap, between now and June 1, is the only chance an engineering leader has to convert a feature bakeoff into a governance pilot before the spend curve starts metering. The good news: every major harness shipped the primitives this evaluation needs to test in the last two weeks. The bad news: nobody told us they were the actual rubric.
The approach
I keep seeing teams burn a two-week harness pilot on prompts, edit accuracy, and IDE feel. That is the work of January. The work of late May is a different test, run against the admin tenant of one harness, on one team, for one week. Here is the version that produces something the finance team can sign.
-
Day 0: Pick one harness and one owner.
One harness. Not three. Not a head-to-head. The team running the pilot will form opinions about the tool itself, and that is fine, and a comparison can come later. For this week the unit being tested is the harness governance layer, not the model. Name one human as the harness owner who will sign every artifact at the end of the week. If no name comes up easily, that is the first thing the pilot already taught us.
-
Day 1: Lock the token tier and write the per-engineer monthly ceiling.
GitHub Copilot CLI v1.0.52 made context-window tier selection (the default 200K vs the 1M-token tier) admin-enforced end to end. Claude Code v2.1.149 added /usage with a per-skill, per-subagent, per-plugin, per-MCP-server breakdown. Cursor has soft spend limits that alert at 50, 80, and 100 percent of org or seat caps. Pick one tier policy for the pilot team. Write the monthly per-engineer ceiling in dollars. A tier policy nobody can describe in a sentence is a tier policy nobody will enforce.
-
Day 2: Audit the MCP allowlist and lock the harness config file.
The SANS Internet Storm Center diary published on May 25 confirmed something the threat-modeling slides have been hinting at since March: the TeamPCP supply chain worm now plants persistence hooks specifically in ~/.claude/settings.json, which is the harness config file, not the operating system. Open the equivalent file for the chosen harness. Write down which MCP servers are listed, which hooks run, which skills are loaded, which subagents are approved. Decide which of those need to be there. Lock the file with the strictest write permission the platform allows, and add a drift-detection alert that pages somebody when the file changes outside an approved push.
-
Day 3: Rehearse the kill switch.
GitHub Copilot CLI v1.0.55-3, shipped on May 27, added a small but load-bearing line to the CLI: a "Helpful message displays when organization policy disables remote-controlled sessions." The kill switch is now an explicit primitive at the CLI layer, not a workaround. Decide who can throw the switch (the harness owner, the security team, both). Decide how long re-enabling takes. Run the drill: pick a team session, disable it from the admin tenant, watch the message render, then re-enable. The drill takes 20 minutes. The runbook the drill produces is what gets signed.
-
Day 4: Wire per-agent observability and a sandbox-only network test.
Claude Code put agent_id and parent_agent_id attributes into OTEL spans starting at v2.1.145, so subagent hierarchies show up in any tracing system the team already runs. GitHub Copilot CLI v1.0.55-1 surfaced loaded extensions, their status, and their source through /env. Both shipped this month. Use them. Then take whichever harness session looked the most useful in the first half of the week and run it inside a self-hosted sandbox with outbound-only network egress, an allowlist of two or three destinations the team actually needs, and read-only access to a sample repository. If the harness cannot complete a real task under that constraint, that is the answer.
-
Day 5: Write the four-line memo to the CFO.
Four lines. One: the harness we picked and the named owner. Two: the token-tier policy and the per-engineer ceiling, in dollars. Three: the kill-switch runbook and the re-enable SLA. Four: the config-integrity check and who gets paged on drift. No appendix, no benchmark table, no SWE-bench score. The CFO does not need to know which model wrote the cleanest test fixture. They need to know that when the AI Credits invoice lands the second week of July, four human decisions and one playbook stand between the team and an unbounded line.
Why most teams get this wrong
Most coding-agent harness evaluations I look at right now are run as if the harness is a development tool. It is not. By the second week of June every harness on the procurement shortlist will be metering tokens at the admin tenant, with admin-controlled budget pools and shadow finance dashboards rolling up usage by team. The harness is a metered utility wearing the costume of a software seat.
The teams that get this wrong do one of two things. They run a feature bakeoff and pick a winner on completion quality, and then three months later find out that the same harness has unbounded MCP load, a config file nobody locked down, and a session-budget alert routed to a Slack channel that has been muted since December. Or they postpone the evaluation, wait for the dust to settle, and turn up to the July budget review with a per-engineer line item that came in at four times the model.
Mario Rodriguez, GitHub’s CPO, put the rubric on the company blog on May 22, verbatim: “The bottleneck has shifted to shipping software: reviewing it, securing it, governing it, and deploying it.” Every Gartner Magic Quadrant Leader published their version of the same line in the same 72 hours. When every vendor at the top of the quadrant is pitching governance instead of speed, the evaluation that still scores on speed is testing the wrong axis.
"Individual engineers were spending between $500 and $2,000 a month on tokens. Around 70 per cent of code committed at Uber now originates with AI."
The numbers
The thresholds the pilot needs to anchor against are public and unambiguous. GitHub Copilot Business stays at $19 per user per month with $19 in monthly AI Credits; Enterprise stays at $39 per user per month with $39 in monthly AI Credits, base seat pricing unchanged. One AI Credit equals one US cent, so a $10 monthly budget covers a thousand credits. Promotional included usage runs $30 per Business user and $70 per Enterprise user for June, July, and August, which is the only soft on-ramp.
The benchmark to size against is not the Anthropic API rate or the OpenAI token price. It is the operating range TheNextWeb published on May 25: individual engineers at Uber spending between $500 and $2,000 a month on tokens, 70 percent of code at Uber AI-originated, Claude Code use inside Uber CTO Praveen Neppalli Naga’s roughly 5,000-engineer org climbing from 32 percent in January to 84 percent in March. A pilot that sets the per-engineer ceiling at $50 a month is not testing the same harness the active users at Uber are running. A ceiling at $2,000 is. Pick a number an internal heavy user can actually hit and still ship.
Ship it
The pilot does not need to be perfect. It needs to be done by Friday. Pick one harness. Name one owner. Set one tier. Lock one config. Rehearse one drill. Write four lines. Most of the work this week is deciding what not to test, so a CFO can read the memo in 30 seconds and an engineer can show up Monday morning knowing what the rules are. If the harness chosen for the pilot turns out to be the wrong long-term bet, the artifacts still port: the token tier, the MCP allowlist pattern, the kill-switch runbook, and the config-integrity check are all harness-portable by design. The work is durable even when the vendor is not.
Sources
- GitHub Copilot CLI v1.0.55-3 pre-release (hook progress streaming, pluginDirectories on session.create / session.resume, organization policy gate for remote-controlled sessions, plugin precedence, reasoning token counts) - GitHub Copilot CLI releases, 2026-05-27
- GitHub Copilot CLI v1.0.55-1 pre-release (/env now shows loaded extensions with status and source) - GitHub Copilot CLI releases, 2026-05-26
- GitHub Copilot CLI changelog (v1.0.52 context window tier selection enforced end-to-end at the admin tenant) - GitHub Copilot CLI changelog, 2026-05-23
- Claude Code release notes v2.1.149-150 (allowAllClaudeAiMcps managed setting, /usage per-skill/subagent/plugin/MCP breakdown, agent_id and parent_agent_id OTEL spans from v2.1.145) - Releasebot / Anthropic, 2026-05-23
- TeamPCP Supply Chain Campaign: Activity Through 2026-05-24 (Mini Shai-Hulud burst, persistence hooks targeting ~/.claude/settings.json) - SANS Internet Storm Center (Diary 33016), 2026-05-25
- Microsoft's quiet Claude Code retreat and the real cost of enterprise AI - TheNextWeb, 2026-05-25
- GitHub Copilot is moving to usage-based billing on June 1 2026 - The GitHub Blog, 2026-04-27
- GitHub recognized as a Leader in the Gartner Magic Quadrant for Enterprise AI Coding Agents for the third year in a row - The GitHub Blog, 2026-05-22