How to Run an LLM Locally When the Best Model Won't Fit

A workstation GPU rack beside a desk with an oversized model file labeled 340GB that will not fit into the available memory slots, illustrating the gap between free open-weight downloads and the hardware needed to serve them.

A new 1-trillion-parameter open-weight model landed this week that almost no team can actually run. Here is how to run an LLM locally as a sizing decision, so the model you pick fits the hardware and the budget you already have.

TLDR

A new open-weight model dropped this week with real benchmark gains and a footprint almost no team can load. Running a model in-house is a sizing decision before it is anything else: pick the model that fits the hardware on hand, quantize to fit comfortably, then prove it still works on the one task that matters.

Free weights, expensive memory

On June 12 a new open-weight coding model landed and the feed lit up. Moonshot AI shipped Kimi K2.7 Code, a 1-trillion-parameter mixture-of-experts model with 32B active parameters, a 256K context window, and a Modified MIT license. The coding gains were not hype. MarkTechPost reported a +21.8% jump on the model’s own coding benchmark, alongside roughly 30% lower reasoning-token usage than the previous version.

"The largest coding jump is Kimi Code Bench v2, from 50.9 to 62.0."

MarkTechPost, June 2026

Then comes the part the launch threads skip. The smallest practical local build of that model needs roughly 340GB on disk in a 2-bit format, around 610GB at its native INT4 weights, and close to 2TB at full precision, plus about 350GB of combined RAM and VRAM to reach usable speed. Modem Guides summed it up in five words: “the weights are free. the memory is not.” That sentence is the whole story of running a model in-house, and it is why so many “we’ll self-host it” plans stall a week after the download finishes.

340GB
disk just to hold this week's 1-trillion-parameter open-weight model at 2-bit, before it serves a single token

How to run an LLM locally, step by step

The fix is not a bigger model. It is a sequence that starts from the silicon on hand and ends at a model that fits with room to breathe. Here is the order that has held up across the teams I have watched do this well.

  1. Start from the VRAM, not the leaderboard

    Write down total VRAM and system RAM before reading a single benchmark. The Kimi release is the cautionary tale: the best open-weight model of the week was also the one most shops physically cannot load. Capability is the second question. Fit is the first.

  2. Pick a model that fits, then quantize to fit comfortably

    A 7B model at 4-bit GGUF lands around 4 to 5GB of VRAM; a 70B at 4-bit fits on a single 80GB card. The well-worn rule of thumb is that a Q4_K_M build holds roughly 95% of full-precision quality at about a quarter of the memory. Quantization is what turns an impossible model into a runnable one.

  3. Do the VRAM math before any purchase order

    Budget weights plus KV cache plus overhead, never weights alone. Context length drives the KV cache, and long context is exactly where the surprise memory bills hide. A 256K window is a feature on the model card and a liability in the VRAM tally.

  4. Match the serving engine to the traffic, not the benchmark

    A single box running a llama.cpp or Ollama setup for an internal tool is a different decision from high-concurrency production traffic, where a throughput engine like vLLM or SGLang earns its keep. Both ends of that spectrum are well served now. The mistake is picking the engine before knowing the traffic shape.

  5. Build a 50-example eval on the one task that matters

    Averages are where models hide their failures. The quant that looks perfect on a public benchmark is the one that quietly breaks tool-calling on the workload nobody tested. A small, task-specific test set catches that before users do, and it takes an afternoon to build.

  6. Decide the local-and-cloud split honestly

    Most products do not need everything in-house. Put the steady, high-volume, privacy-sensitive traffic local, and let the spiky or rare work stay on an API. A hybrid split is not a compromise; it is usually the cheapest correct answer.

Key Insight

Running a model locally is a sizing discipline, not a download. The first real decision is which model fits the hardware on hand, not which model tops the leaderboard.


Why “run AI models locally” trips up smart teams

The phrase “run AI models locally” makes it sound like one move: grab the weights, point an app at them, done. Smart teams get burned in two predictable spots.

The first is buying the model before the math. A trillion-parameter release is thrilling right up until the VRAM tally lands, and by then the GPU is often already ordered. The Kimi footprint is an extreme example, but the pattern is everyday: the model gets chosen on capability and sized on hope.

The second is trusting a quant on someone else’s benchmark. A 4-bit build can score within a point or two of full precision on a public leaderboard and still degrade on the specific thing a product relies on, because the capability that breaks first is rarely the capability the benchmark measures. That is why the eval step is not optional.

And there is the quiet one nobody budgets for: who reboots the box. The average “we’ll just self-host it” plan has a slide for the model and zero slides for who carries the pager. That gap is usually where the projected savings go to die.

The best open-weight model of the week is often the one almost nobody can actually run.


The numbers that decide local versus the API

Here is where the decision gets unsentimental. A June 13 pricing snapshot from BenchLM put a self-hosted Llama 3.1 405B at roughly $18,221 a month to run continuously, while smaller open-weight options like various Qwen sizes estimated out between about $429 and $2,610 a month. Against that, an API such as DeepSeek V3 was listed near $1.10 per million output tokens.

The math that follows is boring and decisive. Self-hosting beats the API on unit cost only above a real volume line, and most teams cannot see their line. Below it, idle GPUs quietly erase the savings. The honest break-even for a mid-size model usually sits in the tens of millions of tokens a day, which is far more traffic than most products generate.

What it takes to hold Kimi K2.7 Code locally
PrecisionApprox. footprint
2-bit GGUF~340GB
Native INT4~610GB
FP16 (full)~2TB

Start with your VRAM, not the model

If the team is starting this week, do not start with the model. Start with a plain list of the VRAM and RAM actually on hand, then shop for a model and a quant that fit inside it with headroom. Run a 50-example eval on the real task before anyone trusts the smaller weights. And name the person who reboots the box out loud, in the plan, before the box exists.

The thing I keep coming back to: open weights got dramatically better this month, and that is genuinely good news. It just did not make the hard part disappear. The hard part was never getting the model. It was fitting it, serving it, and proving it still works on the one job that pays the bills. Get that sequence right and running a model in-house stops feeling like a gamble and starts feeling like plumbing. Reliable, unglamorous, and entirely figure-out-able.

Sources

  1. Moonshot AI Releases Kimi K2.7-Code: a Coding Model Reporting +21.8% on Kimi Code Bench v2 Over K2.6 - MarkTechPost, 2026-06-12
  2. Kimi K2.7-Code: Open Weights, 340GB Reality Check - Modem Guides, 2026-06-12
  3. LLM API Pricing Comparison 2026 (self-host cost estimates) - BenchLM.ai, 2026-06-13

Back to all insights