Curriculum · 7 modules · ~47 min

A working knowledge of on-prem AI inference

Seven short modules, in order. They build on each other — start at the top if you’re new to AI, or jump to a specific module if you know what you’re looking for. By the end you’ll be ready to use the Sizer with confidence and have a productive planning conversation with your AI team.

What is AI inference?
4 min read
Inference vs training, where it sits in the AI workload landscape, and why it's the workload most enterprises will run on-prem first.
Open module
Why your existing playbook breaks
5 min read
Four assumptions about capacity planning that don't survive contact with an inference workload.
Open module
How a model actually serves a request
7 min read
Inside the request lifecycle: model weights in memory, prefill, decode, and why the bottleneck is data movement, not compute.
Open module
The KV cache, the silent capacity killer
6 min read
Why each in-flight conversation needs its own working memory, and why context length and concurrency multiply.
Open module
The seven parameters that drive sizing
8 min read
The full input set: model size, precision, context, concurrent users, RPS, latency SLOs, burst factor.
Open module
A worked example, end-to-end
6 min read
A regional bank's RAG chatbot: walk through inputs, the math, and the resulting infrastructure recommendation.
Open module
The IT-and-AI planning conversation
4 min read
A one-page checklist: who owns which decision, what to ask the AI team, and what red flags to watch for.
Open module
Optimization techniques
7 min read
The AI team's six big levers. Quantization, batching, caching, LoRA, speculative decoding. What each does, how much it changes the math, when to use it.
Open module

What is AI inference?

Why your existing playbook breaks

How a model actually serves a request

The KV cache, the silent capacity killer

The seven parameters that drive sizing

A worked example, end-to-end

The IT-and-AI planning conversation

Optimization techniques