---
title: "Self-hosting an open-weight LLM this quarter: who should own the memory?"
slug: local-self-host-llm-cost-memory-spike
date: 2026-06-22
excerpt: "GPU rental prices are flat year over year while the memory you would buy to self-host is spiking hard. The build-vs-buy question this quarter is really about who absorbs the repriced asset, and a volume line decides it."
featured_image: "https://bbtxujdxvidaghmhxkqs.supabase.co/storage/v1/object/public/generated-images/blog-1782112673117-local-self-host-llm-cost-memory-spike.webp"
featured_image_alt: "A split scene: on the left, a rack of rented data-center GPUs glowing steadily; on the right, a single boxed memory module on a finance desk with a rising price chart behind it, representing the rent-versus-buy decision for self-hosted AI."
canonical_url: https://cerevisor.com/blog/local-self-host-llm-cost-memory-spike
updated_at: 2026-06-22T07:17:54.388044+00:00
---

# Self-hosting an open-weight LLM this quarter: who should own the memory?

I pulled up the live GPU rental tracker this morning, the same one a lot of infra leads keep bookmarked, and the number that stopped me was the one that did not move. As of June 22, GetDeploying has the average H100 renting at $3.11 an hour across 45 providers, lowest $0.47 on spot. A year ago the on-demand figure was $3.48. Today it is $3.60. That is a 3% rise over [twelve months](/blog/four-questions-before-lock-in), in the middle of the loudest memory-price panic the industry has seen in years.

TLDR

GPU rental prices are essentially flat year over year (H100 and H200 up about 3%, B200 up about 7%) even as the memory you would buy to self-host is spiking 60% and more. So the build-vs-buy decision this quarter is not "local versus cloud." It is "who should own the suddenly-expensive, depreciating memory," and a specific token-volume line, not the memory headline, decides the answer.

## The price that barely moved while the headlines screamed shortage

Here is the gap that matters for an executive team right now. The renting side of self-hosting is calm. The buying side is on fire.

GetDeploying’s June 22 snapshot has H200 at a $4.10 average, up about 3% year over year, and B200 at $4.88 average, up about 7%, with the floor at $2.25 an hour on a 36-month reservation. Those are not panic numbers. They are the numbers of a competitive rental market where providers already bought their capacity and now fight each other on the hourly rate.

Now look at the other side of the ledger. Network World reported back in January that Samsung’s 32GB DDR5 modules jumped to $239 from $149, a 60% move, and that DDR5 contract prices had more than doubled to $19.50 a unit from around $7. Gartner has DRAM rising sharply across 2026, and most of next year’s high-bandwidth memory is already sold out under multi-year deals. Memory is now the dominant [line item](/blog/which-ai-line-item-survives-q3-budget-review) in a data-center GPU’s bill of materials.

So the asset a team would buy to self-host got dramatically more expensive, and the option to rent that same asset barely changed. That is the whole tension in one sentence.

+3%

year-over-year change in average H100 cloud rental price, June 2025 to June 2026, while 32GB DDR5 memory rose about 60% over a similar span

---

## The spike changes who eats the hardware cost

The instinct, when [memory prices spike](/blog/build-vs-buy-gpus-memory-spike), is to read it as a signal about local models in general. It is not. The memory spike does not make self-hosting smarter or dumber. It changes who absorbs the cost of the hardware, and that is a different question.

Renting hands the repriced, suddenly-scarce, depreciating memory to a provider who already paid the old price and who carries the risk that next year’s card makes this one worth less. Buying means writing the check for a part bought at the peak of its market, and holding it on the balance sheet while it ages. In a flat-rent, spiking-buy year, that asymmetry is sharper than usual.

None of this repeals the volume math, which is the part most “we will self-host to save money” decks quietly skip. The honest break-even still governs everything. Below roughly half a million tokens a day against a hosted open-model API, pay-per-token is simply cheaper, because a rented or owned GPU costs the same whether it serves zero tokens or four billion. Against the optimized open-model API providers who already run inference at scale on thin margins, that line climbs far higher, into tens of millions of tokens a day. And the GPU invoice is never the full bill. A self-hosted deployment is an operations job: the cost guides converge on a 3-to-5x multiplier once a DevOps salary and idle time get counted, often $750 to $3,000 a month in labor alone before the hardware.

> "On-demand pricing has increased by about 3% since June 2025, from $3.48 to $3.60/hr per GPU."

GetDeploying, H100 cloud pricing tracker, June 2026

## Whether to buy GPUs into a peak market

**“If memory is spiking, should we buy GPUs now before they get worse?”** Only if the workload is large, steady, and long-lived enough to amortize hardware bought at a market peak, and only if the company is above its break-even volume. Buying into a peak memory market to serve a workload that sits below break-even is the most expensive way to make a margin problem worse. For most teams this quarter, the spike is an argument to rent or stay on the API, not to buy.

**“Everyone says local is cheaper. Why does our infra lead keep saying the API is fine?”** Because both are right, at different volumes. Local wins on unit cost above a usage line and loses below it, and the line is driven by sustained token volume and GPU utilization, not by the model. An infra lead who has watched a self-hosted box idle at single-digit utilization is telling the room something true: the cheapest token is the one nobody had to keep a GPU warm to serve.

**“What does this do to our [gross margin](/blog/series-a-gross-margin-q3-investor-question) story?”** That depends entirely on where the volume sits relative to the line, and whether traffic is steady or bursty. A reference point from a February cost breakdown by DevTk: self-hosted Llama 405B runs about $5.47 per million output tokens on an eight-H100 node at roughly $49 an hour, and a budget hosted API can undercut that until the volume gets seriously high. The margin case for local is real, but it is a volume case, not a vibe.

Key Insight

The memory spike is not a reason to buy or to avoid local models. It is a reason to be deliberate about who holds the depreciating asset. Below the break-even volume, renting or staying on the API lets someone else carry the repriced memory. Above it, reserved capacity does the same while the team keeps the savings.

---

## Rental is flat, memory spiked 60 percent

If you have one minute with the board, say this. GPU rental prices are flat year over year while the memory we would buy to self-host has spiked 60% and more. That does not change whether self-hosting saves us money. It changes who eats the cost of the hardware. We should not buy GPUs into a peak memory market unless our volume is high, steady, and proven past break-even. For everything below that line, renting reserved capacity or staying on the API hands the repriced, depreciating asset to a provider, and keeps our balance sheet clean. The decision is rent versus own the memory, and our token volume decides it, not the shortage headline.

As DevTk put it in their February breakdown: “For teams processing fewer than 10B tokens per month, APIs are cheaper, simpler, and better maintained.” That number is not fresh, but the logic it captures is exactly what the June rental data reinforces.

## Watch rental rates and your utilization

Two dials, both calm. Watch whether reserved and spot rental rates stay flat as the memory crunch deepens through the back half of the year; if providers start passing acquisition costs through, the rent-versus-buy math shifts and it is worth re-running. And watch utilization, because that is the variable a team actually controls. The companies that get self-hosting right are not the ones who bought the most GPU. They are the ones who knew their break-even volume before they signed anything, and let the cloud keep the workload that never crossed the line.

#### Sources

- [H100 Cloud Pricing: Compare 45+ Providers (2026)](https://getdeploying.com/gpus/nvidia-h100) - GetDeploying, 2026-06-22

- [B200 Cloud Pricing: Compare 26+ Providers (2026)](https://getdeploying.com/gpus/nvidia-b200) - GetDeploying, 2026-06-22

- [Self-Hosting LLM vs API: Real Cost Breakdown 2026](https://devtk.ai/en/blog/self-hosting-llm-vs-api-cost-2026/) - DevTk.AI, 2026-02-25

- [Samsung warns of memory shortages driving industry-wide price surge in 2026](https://www.networkworld.com/article/4113772/samsung-warns-of-memory-shortages-driving-industry-wide-price-surge-in-2026.html) - Network World, 2026-01-07
