aisle
Module 87 min read

Optimization techniques

What you’ll learn: Identify the six major LLM inference optimization techniques, understand the throughput and memory impact of each, and know when each one is worth adopting in a deployment.

The previous seven modules have given you the planning vocabulary. M7, the checklist, is about who owns which decision. This module is about a specific category of decisions that lives on the AI team's side of the conversation but reshapes the IT-side math: optimization techniques. These are the levers the AI team can pull that don't change what the model does, but change how much hardware it takes to run.

There are six techniques worth knowing by name. None of them are mysterious once you see the picture. All of them appear as inputs in the Sizer's Advanced step, and all of them are choices the AI team is making on your behalf, whether you know it or not.

Quantization

What it is. The model's weights are normally stored at the precision they were trained in, usually FP16 or BF16 (16 bits per number). Quantization stores them at lower precision: FP8 (8 bits), INT8 (8-bit integer), or INT4 (4 bits). Fewer bits per parameter means less memory and less data to move per token.

Impact on the math. FP8 cuts memory and decode bandwidth roughly in half compared to BF16, with negligible quality loss for most production models. INT8 cuts another small notch. INT4 packs four times tighter than FP16 but introduces noticeable quality loss on reasoning-heavy tasks. The 2026 production baseline is FP8: it doubles your effective throughput per GPU with quality that is indistinguishable from BF16 on chatbot and RAG workloads.

When to use it. Always start with FP8 if your hardware supports it (H100 and newer, MI300X and newer). Drop to INT8 only when you need every byte of memory and the AI team has validated quality on your workload. INT4 is reserved for memory-constrained edge deployments or when you are comfortable losing a few percentage points of accuracy.

Continuous batching

What it is. Without continuous batching, an inference server runs requests in static batches: collect N requests, send them through the model together, return all the responses, repeat. The problem is that responses finish at different times. The first request might be done after 10 tokens while another needs 500. The fast ones wait for the slow ones, and the GPU sits partially idle. Continuous batching breaks this by replacing each finished request with a new one immediately, keeping the batch always full.

Impact on the math. The published gain is dramatic: roughly 4x more concurrent requests served on the same GPU compared to naive static batching. The GPU goes from ~30% utilized to nearly fully utilized. This is the single biggest software optimization on the table.

When to use it. Always. Every modern serving engine (vLLM, TensorRT-LLM, SGLang, TGI) does this by default. The only reason you might not be using it is if the AI team is running a custom Python loop, which is a red flag. Ask whether they are using a proper serving engine. The answer should be yes.

PagedAttention

What it is. The KV cache is allocated in fixed-size pages, the way an operating system manages RAM. Without this, the serving engine pre-allocates the maximum possible KV cache size for each request, even when the actual conversation is short. Most of that allocated memory sits unused. PagedAttention allocates only the pages a request actually needs, packed densely.

Impact on the math. Memory waste drops from 60-80% in naive systems to under 4% with PagedAttention. The practical effect is that you fit roughly four times more concurrent requests in the same GPU memory.

When to use it. Always, and it pairs with continuous batching. vLLM invented PagedAttention; TensorRT-LLM, SGLang, and others have adopted it. Same red flag as above: if the AI team is not using a serving engine with PagedAttention, you are leaving capacity on the table.

Speculative decoding

What it is. A small "draft" model runs ahead of the main model, proposing several tokens at once. The main model verifies all the proposed tokens in a single forward pass. When the draft model's guesses are correct (which is most of the time for predictable text), the main model effectively generates multiple tokens per pass.

Impact on the math. Real-world speedup is 1.5x to 3x for chat workloads, depending on how often the draft model is right (the "acceptance rate"). Combined with knowledge distillation, the speedup can reach 6x or higher. The cost is a small amount of GPU memory for the draft model and some engineering complexity.

When to use it. When TPOT is your binding constraint and you have already optimized everything else. Common in latency-sensitive chat deployments, less common in batch or RAG workloads where TPOT is not the bottleneck. Ask the AI team whether speculative decoding is in scope. The answer changes the per-replica throughput estimate by 2x or more.

LoRA (multi-tenant adapters)

What it is. Most fine-tuning today does not update the model's full weights. Instead, it trains a small "adapter" (LoRA stands for Low-Rank Adaptation) that gets added to the base model at inference time. The base model is shared; each tenant or use case has its own small adapter. The adapters are tiny, often a few hundred megabytes, against a base model of tens of gigabytes.

Impact on the math. Without LoRA, you run one base model deployment per fine-tune, which means N replicas for N variants. With LoRA, you run one base model deployment and dynamically load the right adapter per request. Your fleet shrinks from N to 1.

When to use it. Whenever your organization will run multiple fine-tuned variants of the same base model. One adapter per business unit, one per product line, one per language. LoRA collapses what would have been many separate deployments into one shared cluster, often saving 5x or more on hardware.

Prefix caching

What it is. Many requests start with the same prompt. RAG applications inject the same system prompt and retrieval template into every call. Agentic workflows repeat the same tool definitions. The prefill work to process this shared prefix is identical across requests. Prefix caching stores the KV cache for these common prefixes and reuses it instead of recomputing.

Impact on the math. For prefix-heavy workloads (RAG, agents), prefill cost drops by 40-70%. TTFT improves correspondingly. Memory usage goes up slightly because you are keeping the prefix's KV cache resident, but the trade is almost always worth it.

When to use it. Whenever your workload has substantial shared prefix. RAG applications: yes. Agentic frameworks: yes. Pure chat with diverse system prompts: less benefit. Ask the AI team if they have enabled prefix caching and what the cache hit rate looks like in practice. A low hit rate means the application's prompts are more diverse than expected, which is worth investigating.

A few honorable mentions

These are real techniques that show up in production but are not yet first-class in the Sizer:

  • KV cache offload moves cold KV cache pages to CPU RAM or local NVMe, freeing HBM for active conversations. Useful at very large scale, operationally complex.
  • Disaggregated prefill and decode runs prefill on one pool of GPUs and decode on another, letting each pool be optimized independently. Used by Meta, LinkedIn, Mistral. Worth knowing about but Phase 2 territory for most enterprises.
  • Speculative decoding with knowledge distillation pushes the speedup further by training the draft model specifically to mimic the main model. Active research area; production deployments are emerging.

What this means for the planning conversation

When you have your standing meeting with the AI team, three of these techniques are worth confirming explicitly. Ask about precision (always). Ask about continuous batching and PagedAttention (red flag if not enabled). Ask about prefix caching (especially for RAG and agent workloads). The other three (speculative decoding, LoRA, KV offload) come up situationally. The AI team will mention them when they apply.

Treat these techniques as part of the planning baseline. A 70B model on H200 at FP8 with continuous batching and prefix caching needs very different infrastructure from a 70B model on H100 at BF16 without prefix caching. Same model, same hardware vendor, very different replica count.

That is the full optimization vocabulary. Combined with the seven sizing parameters from M5 and the IT-and-AI checklist from M7, you have everything you need to plan honestly.

Try this in the SizerToggle speculative decoding on in the Advanced step and watch the replica count drop. That single optimization can halve your hardware budget for chat workloads.