How to Run an Open-Weight LLM in Production With vLLM

2026-06-25

A rack of datacenter GPUs with a memory-usage overlay showing model weights filling most of the VRAM and a thin remaining slice labeled KV cache.

Serving an open-weight model in-house is a sizing and operations job, not a download. A five-step playbook for budgeting VRAM, pinning a vLLM version, and running the break-even before you buy a single GPU.

TLDR

Running an open-weight model in-house is a sizing-and-operations job, not a download. Budget VRAM as weights plus KV cache plus runtime overhead, pin a known-good vLLM version before trusting the FP8 path, pick a quant on purpose, and run the break-even first. Below a few thousand prompts a day the hosted plan usually still wins, and the DevOps tax is the line item teams forget.

This week two updated self-hosting guides walked through the same exercise, and both landed on a number that surprises people the first time they do the math: the model weights are the easy part of the bill. A Lushbinary guide updated June 25 sized GLM-5.2, the MIT-licensed 744-billion-parameter model that shipped open weights this month, onto private GPUs and put the practical floor at an eight-card H200 node. Not because the weights are huge, though they are. Because once the KV cache and the runtime overhead come in, a node that looks roomy on paper gets tight fast.

That is the whole story of self-hosting in one sentence. The download is free. The operations are not.

So this is a playbook for the part the model card skips: how to take an open-weight model and actually serve it in production with vLLM without it turning into a second on-call rotation. I will use GLM-5.2 as the worked example because it is fresh and well-documented, but the method is the same whether the target is a 744B coding model or a 27B support bot.

The number that ends most “we’ll just self-host it” plans

Start with the sizing, because the sizing decides everything downstream and it is where the optimism dies.

A model’s weight footprint is simple arithmetic: parameters times bytes per parameter. At FP8, GLM-5.2’s roughly 744 billion parameters need about 744GB. An eight-card H200 node gives 1,128GB of aggregate VRAM. Looks comfortable. It is not, and the Lushbinary guide is explicit about why.

"An 8x H200 node provides about 1,128 GB of aggregate VRAM (8 x 141 GB), which leaves meaningful headroom over the weights for KV cache and the 10 to 20 percent runtime overhead."

Lushbinary, June 2026

Read that again. The 384GB that looked like spare capacity is not spare. It is the working space for the KV cache, the per-request memory that grows with context length and concurrency, plus a runtime overhead that the same guide puts at 10 to 20 percent. The KV cache alone can run into many tens of gigabytes per concurrent request even with grouped-query attention and an FP8 cache. Go to full BF16 precision and the weights roughly double to 1,488GB, which points at sixteen GPUs instead of eight.

So the binding constraint is rarely the weights. It is the weights plus everyone talking to the model at once, plus the context window they were promised. That is the trap. A team sizes the box for the weights, ships it, and watches it fall over at moderate concurrency because nobody budgeted the cache.

Key Insight

VRAM is weights plus KV cache plus 10 to 20 percent overhead. Sizing for weights alone is the single most common reason a self-hosted deployment that "fit" in the spreadsheet falls over in production.

The five steps from open weights to a serving endpoint

Here is the sequence I would run, in order, before a single GPU gets provisioned. None of it is exotic. The value is in doing it in this order, because each step kills a plan that the next step would have wasted money on.

Budget VRAM as weights plus cache plus overhead
Compute weights from parameters times bytes per parameter at the chosen precision. Then add KV cache (tens of GB per concurrent request at long context) and a 10 to 20 percent runtime overhead. For GLM-5.2 at FP8 that math lands on an eight-card H200 node as the practical floor, with real headroom only after the cache is accounted for.
Pin a known-good vLLM version before trusting the FP8 path
Serving-engine version is not a detail. The ofox.ai guide is specific: vLLM v0.23.0 is the minimum that serves the FP8 GLM-5.2 path on a general-availability release. Patch releases after that add throughput, but the floor is the floor. Pin the version in the container and test the exact model-and-quant pair on it before promising anyone an endpoint.
Pick a quant on purpose, not by default
Quantization is a lever, not a freebie. AWQ INT4 cuts GLM-5.2 from about 744GB to about 372GB, which moves it from eight H200s to four, at a stated 1 to 3 percent quality regression on coding benchmarks. That is a real tradeoff to make deliberately: halve the box, accept a small measured quality cost, then verify it on the one task that matters rather than trusting the benchmark.
Set the cache and concurrency flags, then load-test them
The serving config is where sizing meets reality. An FP8 KV cache roughly halves per-token cache memory versus BF16, which is often what makes the required concurrency fit at all. Set the memory-utilization ceiling with headroom for allocation spikes, cap the context length to what the workload truly uses, and then push real traffic at it. The cache saturating under bursty load is a far more common failure than the weights not fitting.
Run the break-even before you buy anything
This is the step that should happen first emotionally and last procedurally, because it can cancel the whole project. Price the real prompt volume against a hosted baseline. If the math says hosted wins, the cleanest deployment is the one nobody runs.

That last step deserves its own section, because it is where the most money is saved or wasted.

Why the GPU bill almost never breaks even at the volume you have

The reason most self-hosting plans should stop at step five is that the economics are unforgiving below a surprisingly high volume line. The ofox.ai guide, updated June 23, puts hard numbers on it.

"you need ~3,000+ prompts/day ... for cloud to beat $80/month hosted. That's a 20-developer team running coding agents constantly."

ofox.ai, June 2026

An eight-card H200 cloud node runs roughly 30 to 50 dollars an hour blended. A hosted plan for the same model can be on the order of 30 dollars a month. The crossover, where standing up a private serving stack beats just paying for the hosted endpoint, sits around 3,000 prompts a day, which the guide frames as a twenty-developer team running coding agents constantly. Below about 100 prompts a day with no compliance constraint, its advice is blunt: do not self-host.

3,000+

prompts per day before self-hosting/cloud beats a hosted plan for a frontier open-weight coding model (ofox.ai, June 2026)

And that number is only the GPU side. The part that quietly wrecks the comparison is the operations cost on top. A Digital Applied decision guide from late May put the labor multiplier plainly: DevOps salaries, model update cycles, and infrastructure overhead typically add a three-to-five-times multiplier on top of the GPU rental alone. The GPU is the cheap part of running the GPU.

I have watched this play out the expensive way. A team moves off the API to “save money,” provisions a node, runs it at single-digit utilization because their traffic is bursty, and somehow spends more than before while also now owning a pager. The fix was never a better model. It was batching the traffic, capping the context window to what the workload actually used, and admitting that a chunk of their volume was fine staying hosted. Local won where it should and the API kept the rest.

The GPU is the cheap part of running the GPU. The forgotten line item is the engineer who reboots it at 2 a.m.

What the serving guides agree on, and where to be careful

Step back and the picture across this week’s in-window sources and the recent operations corpus is consistent, which is reassuring because it means the method is stable even as models churn.

The sizing logic (weights plus cache plus overhead) is the same in the June 25 Lushbinary guide and the June 17 Spheron deployment writeup, which lands on the same eight-card H200 floor for GLM-5.2 and adds the concrete serving flags: tensor parallelism across the eight cards and an FP8 KV-cache dtype for the long-context path. The break-even direction (self-host wins only at sustained high volume) is the same in the ofox.ai prompts-per-day framing and the older tokens-per-month analyses, even though they use different units. When independent guides built for different audiences agree on the shape of the answer, that is the part to trust.

Where to be careful: the specific in-window numbers come from a small set of serving guides all keyed to one release, GLM-5.2. The version floor, the eight-card sizing, the prompts-per-day break-even are well-attributed but they describe a single model’s deployment as of late June, not a universal constant. The failure-mode detail (memory-utilization ceilings, the gap between PCIe and NVLink bandwidth that can wreck tail latency in tensor-parallel setups, the 30-to-90-second cold start on serverless) is solid operations knowledge, but it predates this week, so treat it as durable background rather than fresh news. None of that changes the method. It just means the numbers get verified against the actual model and the actual hardware instead of copied.

GLM-5.2 footprint by precision (worked example, June 2026 guides)

Precision	Approx. weights	Practical GPU floor
BF16	~1,488 GB	~16x H200
FP8	~744 GB	8x H200
AWQ INT4	~372 GB	4x H200

What I’d tell you over coffee

If a CTO asked me how to run an open-weight model in production right now, I would not start with vLLM flags. I would start with the prompt volume, because most of the time the honest answer is that the hosted endpoint is fine and the sovereign, control, and compliance reasons to bring it in-house have to carry the decision on their own merits, not on a cost story that does not survive contact with the utilization numbers.

When self-hosting is genuinely the right call, the work is calmer than the breathless version makes it sound and harder than the casual version admits. Size the box for weights plus cache plus overhead, not weights. Pin the serving-engine version. Choose the quant deliberately and test it on the one task that actually matters. Load-test the cache, not just the weights. And run the break-even before signing anything, because the cheapest deployment is still the one a team correctly decided not to run.

That is the whole trick. It is figure-out-able. It just rewards the people who do the sizing before they do the buying.

Sources

Self-Host GLM 5.2: Open Weights & vLLM Guide - Lushbinary, 2026-06-25
Self-Host GLM 5.2 (2026): 8xH200 vLLM Cost vs $30/mo Cloud - ofox.ai, 2026-06-23
Deploy GLM-5.2 on GPU Cloud: Self-Host Z.ai's 744B Coding MoE with 1M Context - Spheron, 2026-06-17
Self-Hosting Open-Weight LLMs: 2026 Deployment Decision Guide - Digital Applied, 2026-05-27

Back to all insights