GGUF vs AWQ vs GPTQ: which quantization format should you actually deploy?

2026-06-27

A side-by-side comparison of four labeled memory blocks for FP16, FP8, AWQ-4, and GPTQ-4 precision, each shrinking in size, sitting next to a single GPU.

The quantization format for a self-hosted open-weight model is a hardware-and-task decision, not a leaderboard one. Here is how to read the GGUF, AWQ, GPTQ, and FP8 tradeoffs without getting fooled by an average.

A team I talked to recently spent two days arguing about which quantization format to deploy, picked the one with the best benchmark number, shipped it, and then watched their agent quietly stop calling tools correctly. Nobody had broken anything. The format was fine. The benchmark was real. It just measured the wrong thing for their workload, and the gap only showed up in production.

TLDR

The quantization format for a self-hosted open-weight model is mostly a hardware-and-task decision, not a quality ranking. GGUF, AWQ, GPTQ, and FP8 land within a point or two of each other on average accuracy, but the differences cluster in reasoning and tool-calling, and the right format depends on which GPU is doing the serving. Pick by the hardware target, then settle the choice with a small eval on the one task that actually matters.

I went looking for a fresh quantization-format release this week and came up empty. No new open-weight model dropped with a format that changes the math, no serving engine shipped a quantization bombshell, and the model-release trackers said as much in plain language: no open-source releases this week. That sounds like bad news for an article, but it is actually the point. The format decision does not move week to week. It is one of the more stable calls in the whole self-hosting stack, which means a team can reason it out once and reuse the answer for a long time.

What “GGUF vs AWQ vs GPTQ” is really asking

When someone asks which format to use, they usually think they are asking “which one is best.” They are actually asking three quieter questions at once: which hardware is doing the serving, how much quality is affordable to lose, and how much throughput the workload needs. The formats sort themselves cleanly along those lines once they stop being treated as competitors for a single crown.

GGUF is the format that came out of llama.cpp, and its whole personality is portability. It runs on CPU, on consumer GPUs, on Apple Silicon, and on mixed CPU-plus-GPU setups where part of the model lives in system RAM. If the box serving the model is a laptop, a Mac, or a single consumer card, GGUF is usually the answer before the conversation even starts. The common Q4_K_M variant keeps roughly 92 percent of full-precision quality while cutting the memory footprint by about three quarters, which is why it is the default a lot of people never need to think past.

AWQ and GPTQ are the datacenter-GPU formats. Both are 4-bit weight quantization aimed at NVIDIA-class hardware, both cut a 70B model from around 140GB at FP16 down to roughly 35 to 40GB plus the KV cache, and both shine when paired with an optimized kernel. AWQ tends to edge out GPTQ slightly on quality because it protects the weights that matter most to the activations. GPTQ has been around longer and has the broadest tooling support. In practice the quality gap between a good AWQ-4 and a good GPTQ-4 is small enough that the kernel and tooling already in place matter more than the format name.

Then there is FP8, which is less a quantization format in the old sense and more the production default of 2026 on Hopper and newer GPUs. It halves memory versus FP16, lifts throughput meaningfully, and does it with almost no quality cost, which is exactly why so many teams reach for it first when the hardware supports it.

Accuracy Retained vs FP16 (six 70B-class open-weight models, background figures, April 2026)

Format	Gap on MMLU-Pro / HumanEval+
FP8	within 0.4 pts
INT8	within 0.7 pts
AWQ-4	within 1.6 pts
GPTQ-4	within 1.9 pts

Why the average accuracy hides the decision

Here is the trap, and it is the same one the team I mentioned fell into. On paper the formats are remarkably close. A benchmark published by Digital Applied in April put real numbers on it across six open-weight 70B-class models.

"Across six 2026 open-weight 70B-class models - Llama 4 70B, Qwen 3 72B, DeepSeek V4-Flash, Mistral Large 2, Command-R+, and Yi 2 - FP8 lands within 0.4 points of FP16 on MMLU-Pro and HumanEval+, INT8 within 0.7 points, AWQ-4 within 1.6 points, and GPTQ-4 within 1.9 points."

Digital Applied, April 2026 (background figures, not a fresh in-window result)

Read that quickly and the conclusion is that the format barely matters. Less than two points of difference at 4-bit, and FP8 is basically free. And on average, that is true. But averages are where models go to hide their failures. The same writeup notes that AWQ-4 buys a 3.1x throughput lift at the cost of a 1.4 to 1.8 point quality regression, and that regression is not spread evenly across everything the model does. The first capabilities to wobble under aggressive quantization are reasoning and code generation, the exact tasks where a two-point average drop can mean an eight-point drop on the narrow thing the product actually depends on.

That is why the team’s tool-calling broke while their benchmark stayed green. MMLU-Pro does not test multi-step function calling. Their product lived or died on it. The format did not fail them. The evaluation did, because it measured the model’s general knowledge instead of its one load-bearing skill.

Key Insight

A two-point average accuracy gap between formats is almost meaningless until you know where those two points come from. For most production workloads they come straight out of reasoning and tool-calling, which is usually the only capability that matters.

Where the speed actually comes from

If quality is roughly a wash, speed is where the formats genuinely separate, and almost all of it comes down to the kernel rather than the format label. This is the part that surprises people.

A raw GGUF Q4_K_M served through vLLM on an H200 has been clocked at around 93 tokens per second, which looks slow. The same class of 4-bit weights served through the Marlin kernel as GPTQ hits roughly 712 tokens per second on the same hardware, and Marlin-AWQ lands around 741. That is not the format being eight times better. That is an optimized kernel built for batched datacenter serving versus a format designed first for a single user on a laptop. FP8 on an H100, for its part, runs about 33 percent faster on output tokens than FP16 with near-zero quality loss, which is why it has become the comfortable default for teams whose GPUs support it.

The lesson hiding in those numbers is that “GGUF is slow” is a category error. GGUF is not slow. GGUF is built for a different deployment shape. Serving one user on a Mac, GGUF is fast enough and nothing else even installs as easily. Batching hundreds of concurrent requests on H100s, the right reach is AWQ or GPTQ with Marlin, or FP8, and GGUF would never have entered the conversation. The throughput chasm between them is mostly a sign of comparing tools built for different jobs.

GGUF is not slow. It is built for a different deployment shape, and most of the throughput gap between formats is the kernel, not the format.

There is one more quiet shift worth naming. More labs now ship quantization-aware checkpoints directly, where the low-bit version is the intended deployment rather than a degraded afterthought. Google’s Gemma 4 QAT checkpoints, released earlier this month, cut memory by roughly 72 percent while staying close to the original quality. When a model ships a native low-bit version, that is almost always the best starting point, because the quantization was done with the training, not bolted on afterward.

How the format choice usually resolves

After watching enough of these decisions, the pattern is almost boringly consistent, and that is a good thing. It means the choice is reliable.

Start with the hardware, because it eliminates most of the options for free. Serving on CPU, Apple Silicon, a single consumer card, or a mixed RAM-and-VRAM box points to GGUF, full stop. Serving on datacenter NVIDIA GPUs with real concurrency points to FP8 when the hardware is Hopper or newer, or to AWQ and GPTQ with the Marlin kernel when 4-bit is needed to fit a tighter VRAM budget. The hardware target does about 70 percent of the deciding before quality ever enters the conversation.

Then check whether the model ships a native quantized or QAT checkpoint. If it does, use it. The maintainers know their weights better than a generic post-training quantizer does, and they have usually published which variant they consider the real one.

Only after those two steps does the average-quality table matter, and by then the choice is between two or three formats that all land within a couple of points of each other. That is the moment to stop reading benchmarks and run a real one. Pull 30 to 50 examples from the task that actually matters, run them through the exact quant headed for production, and look specifically at the load-bearing capability rather than the headline score. A coding agent gets its tool calls tested. A summarizer gets its summaries tested. The whole decision comes down to that small, unglamorous eval, and it takes an afternoon.

~75%

memory cut from a 4-bit format like GGUF, AWQ, or GPTQ versus FP16, with average accuracy typically within 1 to 2 points

What I’d tell you over coffee

The format wars are mostly noise. GGUF, AWQ, GPTQ, and FP8 are not four contestants for one trophy. They are four tools for different boxes, and on average quality they are close enough that arguing about which is best in the abstract is a way to feel productive without deciding anything.

So I would skip the argument. Let the hardware narrow it to one or two formats, prefer the model’s own native low-bit checkpoint when it exists, and then settle the last bit of doubt with a 30-to-50-example test on the one real task at the precision actually headed for production. The teams that get burned are the ones who picked a format off a leaderboard and found out in production that the leaderboard never measured the thing they sell. The teams that stay calm are the ones who spent an afternoon on their own eval and already knew the answer before they shipped. That afternoon is the cheapest insurance in the whole self-hosting stack, and almost nobody buys it until the second time.

Sources

Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Performance Data - Digital Applied, 2026-04-24
GGUF vs GPTQ vs AWQ 2026: Which Quantization Should You Use? - Local AI Master, 2026-03-18
LLM Quantization Explained: GGUF vs AWQ vs GPTQ - The Complete 2026 Guide - Fungies.io, 2026-05-01

Back to all insights