---
title: The Best Local LLM for Coding Is the One That Fits Your GPU, Not the Leaderboard
slug: best-local-llm-for-coding-fit-not-leaderboard
date: 2026-06-23
excerpt: "The open-weight coding model topping SWE-bench is usually the one almost nobody can run on a single box. Here is how to pick a local coding model by fit to your task and your GPU, with the named models that actually run on one card."
featured_image: "https://bbtxujdxvidaghmhxkqs.supabase.co/storage/v1/object/public/generated-images/blog-1782205130209-best-local-llm-for-coding-fit-not-leaderboard.webp"
featured_image_alt: A single consumer GPU on a desk beside a sticky note reading 24GB, with a laptop showing a coding model benchmark chart, illustrating the gap between leaderboard size and what fits on one card.
canonical_url: https://cerevisor.com/blog/best-local-llm-for-coding-fit-not-leaderboard
updated_at: 2026-06-23T08:58:51.106063+00:00
---

# The Best Local LLM for Coding Is the One That Fits Your GPU, Not the Leaderboard

TLDR

The open-weight coding model at the top of SWE-bench is almost always the one a single box cannot run. The models that actually fit one 24GB card, like Devstral Small 2 and Qwen 3.6 27B, sit a few points lower on the chart and are usually good enough for the real workload. Picking by fit to the task and the GPU on hand, not by the leaderboard, is how a team ships a coding model this quarter instead of admiring one.

I keep having the same conversation with infra leads. It starts with “we want to self-host a coding model” and within ten minutes it has drifted into which [open-weight model](/blog/open-weight-model-selection-fit-not-leaderboard) just topped the SWE-bench chart. The trouble is that the model topping the chart and the model a team can actually run on the hardware it already owns are almost never the same thing, and the gap between them is where a quarter goes to die.

Here is the number that makes it concrete. The current open-weight coding leader, GLM-5.2, is a 753-billion-parameter mixture-of-experts model. A guide from Pinggy, updated June 18, clocks it at 79.65 on the LiveBench Coding Average, the highest open score they list. It also needs a multi-GPU rig to run. That is the whole tension in one model: the best score on the board belongs to a thing a single workstation cannot load.

So let me walk through how I actually pick a local coding model, and the named models that go with each decision. None of the numbers below are fresh this week. They come from benchmark writeups dated early-to-mid June and a couple from earlier in the spring, and I will date them as I go. The reason I am comfortable writing this anyway is that the durable part, the method for choosing, does not change when a new model drops. The leaderboard reshuffles. The method holds.

---

## Start with the card on hand, not the model on the wishlist

The first move is boring and it is the one teams skip: write down how much VRAM the box actually has, then pick from the models that fit. Everything else is downstream of that one number.

The rough math is worth memorizing because it sizes any model in a few seconds of mental arithmetic. At FP16 a model needs about 2 bytes per parameter, so a 70B model wants roughly 140GB. At INT4 it drops to about half a byte per parameter, so that same 70B lands near 35GB. Then add 10 to 20 percent on top for the KV cache, activations, and framework overhead. Pinggy and InsiderLLM (updated May 26) both land on the same tiers from there: 8GB of VRAM covers 7B-to-8B models, 24GB is the practical floor for a 30B-class model, and 40GB-plus is the territory for 70B unless the team quantizes hard.

That single calculation kills most of the bad decisions before they happen. It says, before anything has been downloaded, whether the exciting model is a download or a hardware purchase order.

~35GB

VRAM for a 70B model at INT4, versus ~140GB at FP16, before KV-cache and overhead (Pinggy, InsiderLLM, June 2026)

## The 5 open-weight coding models worth shortlisting, by the card they fit

I am going to rank these the way I actually think about them, which is by the hardware they run on, not by raw score. Each one is grounded in a specific benchmark from the June writeups, with the date attached.

- **Devstral Small 2 (24B) for agentic coding on a single card** This is my default recommendation for a team that wants multi-file edits and debugging loops on one GPU. Pinggy (June 18) puts it at 68% on SWE-bench Verified, running on a single RTX 4090 or a 32GB Mac. It was built with All Hands AI specifically to drive coding agents, and Apache 2.0 means the license will not surprise a legal team. One good GPU plus a desire for an agent: start here.

- **Qwen 3.6 27B for the strongest dense model on 24GB** The dense 27B is the best "fits one 24GB card" coder I have seen numbers for. InsiderLLM (May 26) and Pinggy (June 18) both report 77.2 on SWE-bench Verified at roughly 17 to 22GB depending on quant and context. InsiderLLM's own line is that this "puts the 27B in the same range as Claude Sonnet." Every parameter activates per token, so reasoning stays consistent. This is the one I reach for when the task is real code, not autocomplete.

- **Qwen3.6-35B-A3B for MoE latency on the same card** For a team that likes the idea of a larger model but cannot afford the latency of a dense one, the 35B mixture-of-experts variant activates only 3B parameters per token. A DEV Community writeup (June 8) reports 73.4 on SWE-bench Verified and notes it "fits on a single RTX 4090 or M-series Mac with 32GB." The payoff is a big-model knowledge base with small-model speed, which is the entire reason MoE exists.

- **Codestral 22B when the job is autocomplete, not agents** This is the one teams forget, because it does not win the headline benchmark. The DEV writeup (June 8) puts Codestral 22B at 95.3% fill-in-the-middle pass@1 at about 14GB in Q4_K_M. Fill-in-the-middle is what powers inline autocomplete, and it is a different job from agentic editing. When developers mostly want fast, accurate tab-completion in the editor, the agent-tuned models are the wrong tool and this is the right one.

- **Qwen3-Coder-Next (80B / 3B active) as the worked example of model-to-hardware sizing** This one is older, from a February DEV guide, but it is the cleanest worked example of fitting a model to a specific box. The 80B MoE activates 3B per token at 256K context under Apache 2.0, scores 44.3 on SWE-bench Pro, and the guide pairs it with an RTX 5090 32GB plus 128GB of DDR5 to hit 30 to 40 tokens per second at full context. Notice the pattern: the answer is not just a GPU, it is a GPU-and-system-RAM split. MoE models lean on that split, and budgeting for it is half the sizing job.

Coding score versus what it takes to run (background figures, dated)

ModelCoding scoreRuns on

GLM-5.2 (753B/40B)**79.65 LiveBench Coding (Jun 18)**Multi-GPU
Qwen 3.6 27B (dense)77.2 SWE-bench Verified (Jun 18)Single 24GB card
Qwen3.6-35B-A3B (MoE)73.4 SWE-bench Verified (Jun 8)Single RTX 4090 / 32GB Mac
Devstral Small 2 (24B)68% SWE-bench Verified (Jun 18)Single RTX 4090 / 32GB Mac

## Why most teams reach for the model they cannot run

The mistake is not stupidity. It is that the leaderboard is the most visible artifact in the whole space, so it becomes the default proxy for “good.” And the top of that board is genuinely impressive. The same Pinggy snapshot that crowns GLM-5.2 also lists Kimi K2.6, a 1-trillion-parameter MoE at 78.57 LiveBench Coding, needing two H100 80GB cards or four A100s and 512GB of RAM, and DeepSeek V3.2 at 75.69, also multi-GPU. These are real, excellent models. They are also operational commitments most teams have not budgeted for.

Here is the part that does not fit on a leaderboard: the three-point gap between the 79.65 model nobody can run and the 77.2 model that fits one card is almost never the thing that decides whether a coding workflow works. What decides it is whether the model is loaded, fast, and available when a developer hits tab. A model that scores three points higher and runs on hardware the team does not own scores zero in production.

> A model that scores three points higher and runs on hardware the team does not own scores zero in production.

And “best” splits by task in a way a single number hides. Agentic multi-file editing and inline fill-in-the-middle autocomplete are different jobs, and the model that wins one can lose the other. The 95.3% FIM number for Codestral is meaningless for an agent workload, and the agentic scores are meaningless for developers who just want fast autocomplete. The leaderboard collapses both into one ranking. The real workload does not.

For honesty’s sake, here is one of these scores in its original words, describing GLM-5.1.

> "58.4 on SWE-bench Pro, outperforming GPT-5.4 and Claude Opus 4.6."

DEV Community, June 2026

That is a real, specific, attributable number, and it is exactly the kind of line that pulls a team toward a model it has no way to serve. The number is true. The conclusion most people draw from it is the trap.

Key Insight

"Which model is best" is the wrong question for a self-hosted deployment. The better question is "which model clears the task's quality bar while fitting the GPU already on hand," and that one usually has a different, smaller, more boring answer.

## The numbers that prove it is good enough

Once a shortlist fits the card, the move is to stop reading benchmarks and start measuring the actual task. Public coding benchmarks show that a model can code in general. They say nothing about whether it can code this codebase, with these conventions, against this test suite.

Build a standing set of 30 to 50 real examples from the team’s own repository: tickets the model should be able to close, functions it should complete, bugs it should find. Run the shortlisted models against that set at the quant actually destined for production, not the full-precision weights the benchmark used. Measure pass rate, sure, but also measure the things that quietly break first, like whether tool-calls come back well-formed and whether multi-file edits stay coherent. The gap between a model’s headline SWE-bench score and its score on those 50 examples is the only gap that pays the bills.

Then watch utilization and tokens per second under real traffic. A model that clears the quality bar but serves at single-digit utilization is a model the team is overpaying to run. The point of self-hosting was the unit economics, and the unit economics only show up when the card is busy.

## What I would do Monday morning

Write the VRAM number on a sticky note. Cross off every model that does not fit it, including the one at the top of the chart, especially the one at the top of the chart. From what is left, pick by task: Devstral Small 2 or Qwen 3.6 27B for a coding agent on one card, Codestral 22B for autocomplete-heavy editing, the 35B MoE for a bigger brain without the latency tax. Then run those 50 examples against the quant headed for production before anyone calls it done.

The leaderboard will have a new champion by next month, and it will probably be one that still cannot run on the card on the desk. That is fine. The model that ships features this quarter is the one that fits, clears the task bar, and stays loaded. Nothing humbles a roadmap faster than a 753B model and a single 24GB card, and nothing speeds it up like admitting that the 27B was always enough.

#### Sources

- [Best Open Source Self-Hosted LLMs for Coding in 2026](https://pinggy.io/blog/best_open_source_self_hosted_llms_for_coding/) - Pinggy, 2026-06-18

- [The Best Open Source LLMs for Coding Right Now (June 2026)](https://dev.to/zyvop/the-best-open-source-llms-for-coding-right-now-june-2026-n10) - DEV Community, 2026-06-08

- [Best Local Coding Models Ranked: Every VRAM Tier, Every Benchmark (2026)](https://insiderllm.com/guides/best-local-coding-models-2026/) - InsiderLLM, 2026-05-26

- [Qwen3-Coder-Next: The Complete 2026 Guide to Running Powerful AI Coding Agents Locally](https://dev.to/sienna/qwen3-coder-next-the-complete-2026-guide-to-running-powerful-ai-coding-agents-locally-1k95) - DEV Community, 2026-02-04