The LLM quantization myth: a good benchmark does not mean a safe quant for your task

A split illustration of a language model file shrinking from a large full-precision block to a small 4-bit block, with a magnifying glass over a single failing test case while the overall benchmark gauge still reads green.

A quant that scores well on the leaderboard can still break the one task you ship. Here is why averages hide the failure, and the small eval that catches it before your users do.

TLDR

The benchmark almost everyone trusts is run on the full-precision model, while the thing that ships is a quantized version nobody scored on the real workload. At 4-bit and above the average quality loss really is small, which is exactly why the myth survives. But averages hide per-item failures, so the only honest check is a small eval on the actual task against the specific quant headed for production.

This week handed us a near-perfect demonstration, and almost nobody framed it that way.

When GLM-5.2’s open weights and official benchmarks landed on June 18, two things arrived together: a strong scorecard for the full-precision model, and a shelf of community quantizations ranging from a 217GB one-bit build up to an 801GB Q8_0. Totalum counted 24 community quant variants on day one. And not one of them shipped a per-quantization accuracy number. The model card, as AI Weekly noted, gives you sizes and nothing about what each quant does to quality. So here is the freshest, best-documented open model of the month, and the gap between “the benchmark” and “the build I will run” is wide open and completely unmeasured.

That gap is where the myth lives.


The myth that a good benchmark means a safe quant

The myth is simple, and most engineering leaders I talk to half-believe it even when they know better: if a model scores well on the benchmark, the quantized version is safe to ship.

It usually sounds more reasonable than that. “The 4-bit only drops a point or two.” “Quantization is basically free now.” “We checked the leaderboard, it’s fine.” The shared assumption underneath all of them is that the benchmark score travels with the model into the deployment. It does not. The score belongs to a specific build, usually full precision, and the build that lands in production is a smaller, cheaper, faster cousin that was never tested on the task that actually matters.

Why the average accuracy barely moves at 4-bit

Because at the precisions most teams actually use, the average really does barely move.

The honest version of the numbers is genuinely reassuring. A study from Digital Applied earlier this year measured the format hierarchy across six open-weight 70B-class models, and the deltas are small. As Digital Applied put it: “Across six 2026 open-weight 70B-class models, FP8 lands within 0.4 points of FP16 on MMLU-Pro and HumanEval+, INT8 within 0.7 points, AWQ-4 within 1.6 points, and GPTQ-4 within 1.9 points.” Those are background figures from April, not this week, but they match what most infra leads have seen with their own eyes. Drop a big model to 4-bit and the aggregate score wobbles by a point or two. The memory halves. The throughput climbs. The leaderboard says you are fine.

Average accuracy retained vs full precision (Digital Applied, April 2026, background)
FormatGap from FP16 (MMLU-Pro / HumanEval+)
FP8within 0.4 pts
INT8within 0.7 pts
AWQ-4within 1.6 pts
GPTQ-4within 1.9 pts

So the myth is not stupid. It is built on a real, repeatable observation. The problem is that “the average barely moved” and “my task still works” are two different claims, and people treat them as one.


What is quantization in llm terms, and where the average lies

Quick grounding, because the word does a lot of hiding. Quantization in an llm context just means storing the model’s weights (and sometimes its activations and KV cache) at lower numerical precision. Full precision is 16-bit. Common targets are 8-bit and 4-bit, in formats like GGUF, AWQ, GPTQ, and FP8. Fewer bits per weight means a smaller file, less VRAM, and faster math. The whole question of quantization quality is what those fewer bits cost you.

And the answer is not “a flat percentage off everything.” It is “a small amount off most things, and a surprising amount off a few specific things.”

That distinction is the part the average erases. A May preprint on compressed models put real numbers to it: perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across the models tested, yet 2.5 to 5.6% of individual items already develop new biases at 4-bit, and the same body of work found 5 to 16% of answers can flip under quantization. Sit with that for a second. The aggregate metric stays green. Meanwhile up to one answer in six on certain item sets changes. If the answers that flip happen to be the tool-calls, the structured outputs, or the multi-step reasoning a product depends on, that “1.9 point drop” is a production incident wearing a good scorecard.

2.5-5.6%
of items developing new biases at 4-bit while overall perplexity moved under 3% (arXiv preprint 2605.15208, May 2026, background)

Here is the verbatim finding, because it is the whole argument in one sentence.

"perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit"

arXiv preprint 2605.15208, May 2026 (background, corroborating)

There is a second pattern worth naming, because it tells a team where to look. The damage is not uniform across tasks. Reasoning, math, and long-chain work degrade first. Summarization, classification, and short code completion hold up longest. So a quant can ace a coding benchmark and quietly lose the plot on a five-step agent loop, because the benchmark measured the resilient task and the product runs the fragile one.

And the cliff is real once the quant gets aggressive. The only attributable per-quant numbers anyone published for GLM-5.2 came from the Unsloth dynamic-quant benchmark, and they show the shape clearly: the four-bit and five-bit dynamic builds came out generally lossless, the dynamic two-bit landed around 82% accuracy at 84% smaller, and the dynamic one-bit around 76.2% at 86% smaller. Four-bit, fine. One-bit, a quarter of the accuracy is gone in exchange for disk space. Same model, very different quant, and nobody could have flagged that from the official benchmark.

Key Insight

A benchmark score is an average over a test set the vendor chose. A product is a specific task the vendor never saw. Quantization quality is the distance between those two things, and only the team running the workload can measure it.


The better question: good enough for your one task?

The better question is not “is this quant good?” It is “is this quant good enough at the one task we are about to depend on?”

That is the whole shift. Quantization is not free quality and it is not a trap. It is a per-task question with a small, knowable answer. The reason this week’s GLM-5.2 release is the perfect teacher is that it removes the temptation to outsource the judgment. There is no per-quant number to lean on. As Totalum wrote, teams “will need to benchmark the quantized variants against their own tasks rather than assuming parity with the full-precision scores.” That is not a GLM-5.2 quirk. That is the normal state of every quant that ever reaches a deployment. GLM-5.2 just made it impossible to pretend otherwise.

The benchmark belongs to the full-precision model. The thing that ships is a quant nobody scored on the task that matters.

The good news is that closing the gap is cheap. No research lab, no leaderboard submission. Just 30 to 50 real examples from the actual workload, the exact prompts and the exact expected behavior, run against the specific quant headed for production and against the full-precision model as a reference. The goal is not a benchmark number. The goal is to catch the items that flip. If the tool-calls still parse, the reasoning still lands, and the structured output still validates, ship it. If 1 in 10 quietly breaks, the team caught it in an afternoon instead of in a support queue.

Run 30 to 50 real examples before the next deploy

This week the practical move is small and worth doing before the next deploy. Pick the task a team would be most embarrassed to break. Write down 30 to 50 examples of it. Run them against full precision once to set a reference, then against whatever 4-bit or FP8 build is tempting. Keep that little eval in the repo, because the next open-weight model is days away, and at this cadence it pays to point the eval at a new quant without thinking.

The teams that get burned by quantization are not the ones who quantized too hard. They are the ones who trusted a number that belonged to a different build. The average will keep saying everything is fine. Fifty real examples will tell the truth, and they cost one afternoon. That is the cheapest insurance in this whole stack, and it is the one most people skip because the benchmark already made them feel safe.

Sources

  1. GLM 5.2 Benchmarks Published: 62.1 SWE-bench Pro, MIT-Licensed Weights on HuggingFace - Totalum, 2026-06-18
  2. Unsloth Quantizes GLM-5.2's 1.51TB to 217GB for Local Inference - AI Weekly, 2026-06-18
  3. Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Performance Data - Digital Applied, 2026-04-24
  4. Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels - arXiv (preprint 2605.15208), 2026-05-15

Back to all insights