What Actually Breaks When You Self-Host an LLM in Production

2026-06-20

A rack of GPU servers with a single warning indicator lit, representing an out-of-memory failure during self-hosted LLM inference.

Self-hosting an open-weight model in production rarely breaks on the model. It breaks on the operations around it: KV-cache sizing, CUDA out-of-memory, cold-start latency, and the volume line below which the API was cheaper all along.

TLDR

When a self-hosted LLM falls over in production, the model is almost never the problem. The operations around it are: KV-cache sizing that triggers CUDA out-of-memory, a 30 to 90 second cold start nobody planned for, and a quantization choice that ties a team to one GPU generation. Size the memory and the headroom before sizing the model, and know the break-even volume before signing the node.

A self-hosting guide I read this week, published June 17, has one line in it that is worth the whole article. The advice for running a big open-weight model on an 8x H200 node was not a clever flag or a magic batch size. It was this: “Leave 20% VRAM headroom above the weights-plus-KV total. CUDA fragmentation will eat the rest.” That sentence is the entire job. The model fits, the demo works, and then production walks in with traffic the demo never had.

I keep watching teams do the same thing. They size the GPU for the weights, see the weights fit, and call it sized. Three weeks later the box is throwing out-of-memory errors at 11pm and nobody can say why, because the weights still fit. They always did. The thing that grew was everything around them.

The standard GLM-5.2 self-host setup and its KV cache

The fresh example this week is GLM-5.2, the open-weight 744B-parameter Mixture-of-Experts model from Z.ai that landed mid-June with MIT-licensed weights. A wave of self-hosting writeups followed within days, and they are a clean window into what a real production setup looks like right now.

The shape is consistent across all three guides I read. A practical FP8 deployment of this model is a single multi-GPU node of roughly eight H100 or H200 cards, because the FP8 weights alone run about 744GB. One guide put the floor at 10x H100 80GB for FP8 and 20x H100 for the full BF16 version. Nobody is running this one on a spare workstation, and that is the first honest signal: the headline open model is rarely the one a single box can serve.

The serving stack is where the operational decisions live. The recommended vLLM configuration in these guides reads like a checklist of things teams only learn the hard way: --tensor-parallel-size 8 to split the model across the cards, --kv-cache-dtype fp8 to shrink the cache, and --enable-chunked-prefill plus prefix caching to keep latency sane under load. SGLang shows up as the alternative, with its RadixAttention giving meaningfully higher throughput when a large system prompt gets reused across requests. These are not exotic settings. They are the difference between a server that holds up and one that does not.

And then there is the cache. A 1M-token context window sounds like a feature until its memory gets priced. One guide noted that the KV cache for that context at FP8 runs about 80GB even at a batch size of one. That is a second model’s worth of memory, sitting on top of the weights, scaling with every concurrent user and every long document they paste in.

~80GB

KV cache for a 1M-token context at FP8, batch size one, sitting on top of the weights

OOM, cold start, and silent failures

The dominant failure mode in self-hosted serving is not a crash from a bad model. It is CUDA out-of-memory, and it is an operations problem wearing a hardware costume. It shows up when the tensor-parallel size is set too low, or when the KV-cache budget was set for the prompts in the demo rather than the prompts in production. The fix is rarely a bigger GPU. It is leaving the headroom the June 17 guide named, and sizing the cache for real concurrency instead of a single happy-path request.

Cold start is the second thing that surprises people. The same guide measured first requests at 30 to 90 seconds while the server warms up, then sub-second for short prompts after that. When an autoscaler spins up a fresh replica during a traffic spike, the users who triggered the spike are the ones who wait a minute and a half. That is a reliability decision disguised as a performance number, and most teams find it in production rather than in planning.

The third trap is quieter, and it is the one that scares me most. A production study published in mid-June looked at an LLM agent system over eight weeks and catalogued 22 incidents with full postmortems. The finding that stuck with me was about how the failures got caught.

"Roughly 70% of silent failures were ultimately caught by human user-view observation of system output, not by unit tests, health checks, or governance audits."

arXiv, When Errors Become Narratives, June 2026

That study sits a few days outside the strict window for this week’s fresh news, so treat its numbers as recent background rather than breaking signal. The system it described ran 4,286 unit tests and 827 governance checks, and the silent failures still slipped past nearly all of them, with the silence lasting anywhere from 13 hours to 60 days. The in-window serving guides describe the same surface from the other side: a model whose KV cache evicts the wrong tokens can slide into degenerate repetition with no exception thrown, no OOM, no alert. The server says it is healthy. The output is broken. Only a human reading it can tell.

Key Insight

The most dangerous self-hosted failures throw no error. The server reports healthy, the metrics look green, and the output is quietly wrong. A human-visible check on real output catches what a test suite structurally cannot.

The weights are the demo, the operations are the job

Step back from GLM-5.2 and the pattern is the same for any open-weight model a team self-hosts. The weights are the part the vendor demo shows. The operations are the part the team inherits.

Quantization is a good example of how this compounds. FP8 is the practical default for serving these large models now, and it cuts the memory roughly in half. But FP8 math needs Hopper-class GPUs, the H100 or H200; older Ampere cards fall back to GGUF instead. So a quantization choice is also a hardware-generation choice, and the two are welded together in a way that does not show up until procurement. One guide also flagged that FP8 “can affect accuracy on tasks at the edge of the model’s capability,” which is the polite way of saying the quant that looks fine on a benchmark might wobble on the one task that actually matters.

The market is already voting with its downloads. The same guide noted the FP8 build of this model had roughly 93,900 downloads against around 11,900 for the full-precision BF16 version. Teams are reaching for the smaller footprint by an eight-to-one margin, because the smaller footprint is the one that fits on hardware they can get.

What self-hosting actually costs a team in operations

The part the demo shows	The part a team inherits
Weights fit in VRAM	KV cache + headroom blow the budget under load
Fast response in testing	30 to 90 second cold start on a fresh replica
Model passes the benchmark	FP8 dents accuracy at the edge of capability
Tests and health checks pass	Silent failures slip past, caught only by a human

Then there is the cost line, which is the part a founder will ask about first. Self-hosting beats the hosted API on unit cost only above a real and surprisingly high volume. One guide put the break-even at roughly 2.4 billion output tokens per month, comparing about $23 per million tokens self-hosted against $4.40 per million on the vendor’s API. Another framed it in plainer terms: a setup needs somewhere north of 3,000 prompts a day and a 30% duty cycle on the node just to beat an $80-a-month hosted plan, and closer to 10,000 prompts a day to justify owning the hardware. An 8x H200 node runs $30 to $50 an hour, which is roughly $29,000 a month kept lit around the clock.

The weights are the part the vendor demo shows. The operations are the part you inherit.

That is the math nobody puts in the “we’ll self-host to save money” deck. The GPU costs the same whether it serves zero tokens or four billion, so below the break-even line, a rented API is simply cheaper and the self-hosted box is a monument to good intentions running at 9% utilization.

Model selection is the easy 20 percent

For a team weighing a self-hosted deployment, the model selection is the easy 20% of the decision. The operations are the other 80%, and they are knowable in advance with three moves made before any hardware gets bought.

Size the memory, not just the weights. Add up the weights, the KV cache at real concurrency and context length, and then the 20% headroom that June 17 guide insisted on, and size to that total. The cache and the headroom are where the OOM lives, and they are also where the planning usually stops.

Keep a human-visible check on real output. A test suite is necessary and it will still miss the silent failures, because the dangerous ones produce fluent, plausible, wrong text that no health check is built to notice. A cheap sample of actual production output, read by an actual person, catches what eight hundred governance checks did not.

And do the volume math honestly before signing the node. Find the break-even, in tokens or prompts a day, and look at where the real traffic sits relative to it. Below the line, the API is not a failure to self-host. It is the correct answer, and keeping the heavy steady workloads local while the rest stays on the API is not a compromise. It is just good sizing.

Self-hosting an LLM is an operations discipline a team staffs for, not a download it runs. The teams that do it well are not the ones with the biggest model. They are the ones who sized the cache before the GPU, kept a human in the loop on output, and knew their break-even before anyone asked. That is the whole trick, and it is far more figure-out-able than the 11pm out-of-memory page makes it feel.

Sources

Deploy GLM-5.2 on GPU Cloud: Self-Host Z.ai's 744B Coding MoE with 1M Context (2026 Guide) - Spheron Blog, 2026-06-17
Self-Host GLM 5.2 in 2026: Hardware, vLLM Setup, and Cost vs Cloud - ofox.ai, 2026-06-17
Running GLM-5.2 at Home: SGLang, vLLM, Transformers, and KTransformers Setup Guide - Groundy, 2026-06-18
When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime - arXiv (Wei Wu), 2026-06-12

Back to all insights