The self-hosted LLM to run is the one the VRAM tier already fits

2026-06-29

A three-rung ladder diagram labeled with GPU VRAM tiers, 24 to 32 gigabytes at the bottom, 80 to 96 gigabytes in the middle, multi-GPU cluster at the top, each rung holding a differently sized open-weight model block.

A fresh hardware-tier matching matrix lines up 2026 open-weight coders against the GPU a team actually owns. The model to self-host is set by the VRAM tier on hand and a real task eval, not the leaderboard top.

TLDR

A hardware-tier matching matrix published this week lines up 2026 open-weight coders against the GPU that actually runs them. On a single 96GB card, an 80B sparse model fits at ~46GB and verifies 71.3% on SWE-bench Verified; the biggest open release, GLM-5.2 at 744B, needs four H200s minimum. The model to self-host is decided by the VRAM tier on hand and a 30-example eval on the real task, not by who tops the leaderboard.

I read a hardware-tier matching matrix this week that did something most local-model writeups skip. Instead of ranking open-weight coders by benchmark, it ranked them by the GPU tier that can actually run each one. Digital Applied published it on June 29, and it is the cleanest answer I have seen to the question every engineering leader actually asks before they self-host anything: not “what is the best open model,” but “what is the best open model I can fit on the box I already have.”

That is a different question, and the gap between the two is where most self-hosting plans quietly fall apart.

The pick is set by the VRAM tier, not the leaderboard

Here is the problem this solves. A team decides to self-host, opens a leaderboard, sees the top open-weight model, downloads it, and discovers it needs more GPU than they own by an order of magnitude. The weights are free. The memory is not, and the memory is the binding constraint.

So start from the hardware, not the leaderboard. The matrix sorts cleanly into three tiers.

On a consumer or single workstation card up to about 32GB, the comfortable pick is Devstral Small 2, a 24B dense model that runs at roughly 12GB at 4-bit and verifies 68.0% on SWE-bench Verified at 25 to 35 tokens per second. Qwen3-Coder-30B-A3B is the other option in that tier, fitting around 15GB at about 64%. Either one is a real coding model, not a toy, and both leave headroom on a 24GB card.

On a single 96GB workstation card, the RTX PRO 6000 Blackwell class, the picture opens up. Qwen3-Coder-Next, an 80B mixture-of-experts model that activates only 3B per token with linear attention, sits at about 46GB in Q4_K_M and leaves roughly 50GB for KV cache. It verifies 71.3% on SWE-bench Verified at 40 to 59 tokens per second. Devstral 2, a 123B dense model, also fits this tier but tightly, around 62GB at 4-bit, 72.2% verified, at a slower 20 to 29 tokens per second. As Digital Applied put it:

"An RTX PRO 6000 Blackwell (96GB) runs Qwen3-Coder-Next (80B MoE) comfortably or Devstral 2 (123B dense) tightly at 4-bit."

Digital Applied, June 2026

Then there is the top tier, and this is the honest part. GLM-5.2, the 744B/40B mixture-of-experts model that everyone talked about when it shipped, lands at about 372GB even at INT4 and needs a minimum of four H200s. It reports 62.1% on SWE-bench Pro. It is genuinely open weight, and it is genuinely not a single-box decision. Most teams will never make that purchase, and they do not need to.

Why downloading the biggest open model is the wrong first move

Most teams get this wrong in the same way, and it is an understandable mistake. The leaderboard is the most visible artifact in the open-weight world, so it becomes the default shopping list. But the model at the top of the open leaderboard is almost never the model a normal box can serve on the hardware on hand.

There is a second, quieter version of the mistake. Even the best self-hostable open-weight coder tops out in the low 70s on SWE-bench Verified, while the closed frontier sits in the high 80s to mid 90s. As Digital Applied stated flatly, self-hostable open-weight models “top out around 71-72% SWE-bench Verified.” So if the plan was to self-host because the open models are now as good as the closed ones, the plan needs an honest edit. The best open model available to download is no longer a toy. It is also not the best coder available. Those are both true, and a good selection process holds both at once.

The third thing teams miss is that fitting in VRAM is necessary but not sufficient. Memory bandwidth, not capacity, sets token speed. A 123B dense model that fits in 62GB still runs at 20 to 29 tokens per second, which feels very different in an interactive coding loop than the 80B sparse model at 40 to 59. And the card that runs all of this, the RTX PRO 6000 Blackwell, launched at roughly $8,565 MSRP but has high and volatile street pricing right now thanks to the 2026 GDDR7 and DRAM shortage. “It fits” is the start of the decision, not the end of it.

Key Insight

Capacity decides whether a model loads. Bandwidth decides whether it is usable. A model that fits in VRAM but runs at 20 tokens per second on a card nobody can reliably buy at MSRP is a different decision than the spec sheet suggests.

The VRAM math that ladders the whole decision

The matrix is handy, but the sizing does not depend on it. The background guides converge on a simple rule of thumb: a dense model at Q4 needs about 0.6GB per billion parameters, plus a few gigabytes for context. That single number ladders the entire decision.

Run it down the tiers. An 8 to 12GB card handles an 8B-class model around 5GB. A 16GB card handles a 12B at about 7.5GB, or a 24B like Mistral Small 3.1 at roughly 13GB and 55 tokens per second. A 24 to 32GB card handles a 27B around 16GB. A 48GB card handles a 70B at Q4_K_M, around 40GB. A 128GB unified-memory box handles a roughly 200B sparse MoE in the 110 to 120GB range. Those tier numbers are background, dated June 9 or evergreen, but they are stable and they match the fresh matrix in shape.

Open-weight coder by GPU tier (SWE-bench Verified, source: Digital Applied, Jun 2026)

GPU tier	Model	VRAM at 4-bit	Score
≤32GB	Devstral Small 2 (24B)	~12GB	68.0%
96GB	Qwen3-Coder-Next (80B MoE)	~46GB	71.3%
96GB	Devstral 2 (123B dense)	~62GB	72.2%
4x H200	GLM-5.2 (744B MoE, INT4)	~372GB	62.1%*

One more honest note on timing. Nothing new shipped this week. The llm-stats updates tracker said plainly there were no open-source releases, with GLM-5.2 from June 16 still the freshest open-weight model and only a proprietary, API-only release more recent. That is not a gap in the news. It is the finding. Self-hosting selection is a stable, hardware-driven fit decision right now, not a chase-the-release scramble.

The discipline that decides a good self-hosted deployment is not release-watching. It is a standing 30-to-50-example eval, at the deploy quant, on the hardware actually on hand.

What I’d do before the next hardware PO

Match the model to the tier already on hand, then prove it on the real task before committing. On a single 24 to 32GB card, start with Devstral Small 2 or Qwen3-Coder-30B and expect a real high-60s coder. On a 96GB workstation card, Qwen3-Coder-Next is the comfortable default and Devstral 2 is the tighter, slightly stronger, slightly slower alternative. If GLM-5.2 was the target, price the four-H200 cluster honestly first, because that is a different conversation than buying one card.

Then run 30 to 50 real examples from the actual codebase against the model at the exact quant slated to deploy, on the exact hardware. The leaderboard score was measured at full precision on someone else’s tasks. That local eval is the only number that says whether the model a team can afford to run is good enough for the work it actually does. Pick for fit, prove with eval, and let the leaderboard stay what it is: interesting, not decisive.

Sources

Best Open-Weight Coding Models to Self-Host in 2026: a hardware-tier matching matrix - Digital Applied, 2026-06-29
Best Local AI Models by VRAM: 8GB to 384GB (June 2026) - Modem Guides, 2026-06-09
GLM-5.2 self-hosting guide (vLLM): VRAM-by-precision floor and multi-GPU requirements - Lushbinary, 2026-06-16

Back to all insights