M3 · How a model actually serves a request

There is a model. It lives inside the memory of one or more GPUs. The team has loaded its weights once, at startup, by streaming tens of gigabytes of numbers from a model registry into the GPU's high-bandwidth memory. From that moment on, the model is ready to serve.

A request lands. The user has sent a prompt, perhaps "What does our parental leave policy say about adoption?" After authentication, rate limiting, and a routing decision, that prompt arrives on a specific GPU as a sequence of tokens. Tokens are the unit of work the model thinks in. Roughly, one token is three or four characters of English. A 1,000-word question is about 1,500 tokens.

What happens next is the entire substance of running inference, and it takes place in two distinct phases. This module describes both, and explains why one of them dominates the time the user spends waiting.

Phase 1: prefill

The first thing the GPU does with a new prompt is read it. All of it, at once.

Reading a prompt is not as simple as it sounds. The model needs to compute, for every word in the prompt, a set of intermediate values that capture what that word means in this specific context. The word "leave" in "parental leave" carries different meaning than "leave" in "what time should we leave." These intermediate values are how the model builds up its understanding of the request.

For the model, building this understanding requires running the entire prompt through every layer of the network. A 70 billion parameter model has dozens of these layers stacked on top of each other. The prompt flows through them. By the end, the model has a complete representation of the user's question and is ready to begin answering.

This phase is called prefill. Two things matter about it.

First, prefill is fast in wall-clock terms because it runs in parallel. The model can process all 1,500 tokens of a prompt at the same time, the way a chef preps every ingredient simultaneously rather than one by one. A 1,500-token prompt on a 70B model on two H200 GPUs might complete prefill in around 200 milliseconds.

Second, prefill builds up the working memory for the rest of the request. That working memory is called the KV cache, and it is the subject of the next module. For now, the important thing is that every token in the prompt contributes a small but consistent amount of memory to that working state.

Prefill is the reason you see a slight delay before the first character of an AI answer appears. That delay is called Time to First Token, often shortened to TTFT.

Phase 2: decode

Once prefill is done, the model starts generating the answer. It does this one token at a time.

This sounds inefficient, and it is, but it is how language models work. The model picks the most likely next word given everything it has seen so far. It generates that word. Then it picks the most likely next word given everything it has seen plus the word it just generated. And so on, until the response is complete.

This phase is called decode, and it has a property that drives most of the infrastructure conversation. Every single output token requires the GPU to read the entire set of model weights and the conversation's KV cache from memory, run them through the network, and write the result back.

Reading 70 gigabytes of weights for one word. Then reading 70 gigabytes again for the next word. Then again and again, until the answer is complete. The GPU's compute units are mostly idle during this phase. They are waiting for memory to deliver the data.

This is the moment to internalize the central point of this curriculum. The bottleneck during decode is not compute. It is how fast the GPU can move data from its own memory to its compute units. This metric is called memory bandwidth, and it is the single most important spec to look at when choosing a GPU for streaming inference.

An H100 delivers about 3.4 terabytes per second of memory bandwidth. An H200 delivers about 4.8. A B200 delivers about 8. The decode throughput, in tokens per second per GPU, scales roughly with this number, not with the FLOPS rating.

The time between consecutive output tokens is called Time per Output Token, or TPOT. For a streaming chatbot to feel responsive, you want TPOT below 50 milliseconds, ideally below 30.

A complete request, end to end

A request from the user's keyboard to the streamed response looks like this:

A single request, from authentication to streamed response. Prefill happens once, in parallel. Decode happens once per output token, sequentially.

The total wall-clock time the user waits is TTFT plus the number of output tokens multiplied by TPOT. A 300-token answer at 30 ms per token plus a 200 ms prefill takes about 9.2 seconds. Streamed back token by token, this feels natural. Delivered as a single response at the end, it would feel slow.

Now you can see why the same model on the same hardware can feel fast or slow. If the GPU has lots of cores but limited memory bandwidth, prefill is quick but decode crawls. If the GPU has lots of memory bandwidth but the model is small, the bottleneck is the prefill of long prompts. The infrastructure choices, the model choices, and the request shape all interact.

What this means for planning

Three things follow from understanding prefill and decode separately.

First, the GPU you choose matters for a specific reason: memory bandwidth, not FLOPS. For LLM inference, the answer to "is this GPU enough" is mostly about the bandwidth number, not the marketing throughput number.

Second, the prompt shape matters. Long prompts cost real prefill time. A workload that sends 10,000-token prompts behaves very differently from one that sends 500-token prompts, even at the same user count.

Third, the response shape matters. Long responses spend most of their time in decode, which is bandwidth-bound. Short responses are dominated by prefill, which is compute-bound. The optimal hardware shifts depending on which one dominates your workload.

The next module covers the KV cache. It is what gets built during prefill, what gets read on every decode step, and the single biggest factor in how many concurrent users your infrastructure can serve.

Try this in the SizerTry tightening the TPOT target from 40 ms to 20 ms and watch the engine step up to a higher-bandwidth GPU. That step-up is decode bandwidth pressure showing up in the planning math.