Curriculum · 7 modules · ~47 min
A working knowledge of on-prem AI inference
Seven short modules, in order. They build on each other — start at the top if you’re new to AI, or jump to a specific module if you know what you’re looking for. By the end you’ll be ready to use the Sizer with confidence and have a productive planning conversation with your AI team.
What is AI inference?
4 min readInference vs training, where it sits in the AI workload landscape, and why it's the workload most enterprises will run on-prem first.
Open moduleWhy your existing playbook breaks
5 min readFour assumptions about capacity planning that don't survive contact with an inference workload.
Open moduleHow a model actually serves a request
7 min readInside the request lifecycle: model weights in memory, prefill, decode, and why the bottleneck is data movement, not compute.
Open moduleThe KV cache, the silent capacity killer
6 min readWhy each in-flight conversation needs its own working memory, and why context length and concurrency multiply.
Open moduleThe seven parameters that drive sizing
8 min readThe full input set: model size, precision, context, concurrent users, RPS, latency SLOs, burst factor.
Open moduleA worked example, end-to-end
6 min readA regional bank's RAG chatbot: walk through inputs, the math, and the resulting infrastructure recommendation.
Open moduleThe IT-and-AI planning conversation
4 min readA one-page checklist: who owns which decision, what to ask the AI team, and what red flags to watch for.
Open moduleOptimization techniques
7 min readThe AI team's six big levers. Quantization, batching, caching, LoRA, speculative decoding. What each does, how much it changes the math, when to use it.
Open module