M4 · The KV cache, the silent capacity killer

The previous module introduced the KV cache as the conversation's working memory. This module is about why that working memory is the single biggest determinant of how many users your infrastructure can serve at once.

If you take one thing from this module, take this. The KV cache grows in two directions, with context length and with concurrent users, and the two multiply. When you change either one, the memory budget moves by a lot, not a little.

What actually lives in the KV cache

When the model reads a prompt during prefill, it computes intermediate values for every token. The model could throw these away and recompute them as needed, but that would be wildly expensive. The math says: store them, and reuse them.

These stored intermediate values are the KV cache. The name comes from the math of attention, the mechanism that lets each token "see" the others. For every token in a conversation, the model stores two vectors, called the key and the value, in the GPU's high-bandwidth memory.

Two facts about this storage matter.

First, the amount of storage per token is fixed by the model. A given model uses a specific number of bytes per token, regardless of what the token is. The exact number depends on the model's architecture, the data type, and how attention heads are grouped. For Llama 3.1 70B at FP8, each token takes roughly 160 kilobytes of KV cache space.

Second, that storage lives in HBM, the same memory the model weights live in. There is no spilling to RAM, no swap to disk. Every active conversation's KV cache is competing for the same physical space as the weights themselves.

How it scales

If one token takes 160 kilobytes, then a 1,000-token conversation takes about 160 megabytes. A 10,000-token conversation takes 1.6 gigabytes. A 128,000-token conversation, which is what Llama 3.1 supports at maximum, takes roughly 40 gigabytes. For one user.

That last number is the one to internalize. A Llama 3.1 70B model, at FP8, with a single user holding 128,000 tokens of conversation history, uses 40 gigabytes of GPU memory just to remember the conversation. That is more than half the size of the model itself.

Now multiply by concurrency. If five users are simultaneously holding 128K contexts, you need 200 gigabytes of KV cache space, plus 70 gigabytes for the weights, plus a margin for activations and headroom. You have just walked past the limit of a single H100 with 80 gigabytes of memory.

This is the multiplication that traditional planning misses. The headline is the model size. The reality is the model size, plus context length times concurrent users times bytes per token.

A concrete budget

Picture a 70B model at FP8 serving an enterprise RAG application. Average prompt is 6,000 tokens of retrieved context plus chat history. Peak concurrent in-flight requests is 64. Each request's KV cache takes about 1 gigabyte at this context length.

The memory budget looks like this:

Total: ~180 GB. Fits comfortably on two H200s (282 GB combined) or two MI300X cards. Does not fit on two H100s (160 GB combined).

The exact same workload on the exact same model lands on different hardware depending on a single parameter you might have brushed past in the spec: the average context length.

Halve the context to 3,000 tokens and the KV cache drops to 32 GB. Now the workload fits on two H100s. Double the context to 12,000 tokens and the KV cache becomes 128 GB. Now you need three H200s, or you need to quantize KV separately, or you need to revisit the conversation about whether 12,000 tokens of context is actually necessary for this use case.

Why this is the silent killer

Most planning conversations focus on the model size. A team picks a 70B model. The IT side computes 70 GB times some factor, calls it 150 GB, and feels reasonably confident.

The KV cache is silent because nobody mentions it. The AI team rarely thinks about it explicitly because their serving engine handles it for them. The IT team has no equivalent in traditional infrastructure. So the multiplication of context length and concurrency, the thing that drives the actual memory footprint, slips through the gap.

The fix is to ask three specific questions before any sizing exercise. What is the maximum context length the application supports? What is the realistic peak concurrent in-flight request count? What precision will the KV cache be stored in? These three numbers, multiplied, give you the KV budget. Add the model weights. That is your honest memory floor.

What helps

A few choices reduce KV pressure without changing the application.

Quantizing the KV cache to FP8 halves its size, often with negligible quality loss. Some serving engines offer this as a toggle; some require explicit configuration.

Some model architectures use a technique called MLA that compresses KV by roughly ten times. DeepSeek-V3 is the best-known example. If KV pressure is the limiting factor, the choice of model architecture starts to matter more than the parameter count.

Some serving engines support offloading cold KV blocks to CPU RAM or NVMe, trading latency for capacity. This is operationally complex and usually reserved for very large deployments.

The simplest thing, when KV pressure is the limit, is to cap the context length the application accepts. Long contexts are often inherited from the model's spec without being needed by the application. Capping at 16K or 32K tokens when the model supports 128K leaves significant headroom on the table.

The next module names the seven parameters that drive every sizing decision. Two of them, context length and concurrency, are what you have just seen multiplied here. The other five round out the planning conversation.

Try this in the SizerOpen the Sizer with a 128K-token context and watch the memory budget swell. Then drop max_context_tokens to 8,000 and see the recommendation collapse to fewer GPUs. This is what context length pressure looks like in practice.