aisle

Curriculum · 7 modules · ~47 min

A working knowledge of on-prem AI inference

Seven short modules, in order. They build on each other — start at the top if you’re new to AI, or jump to a specific module if you know what you’re looking for. By the end you’ll be ready to use the Sizer with confidence and have a productive planning conversation with your AI team.

  1. What is AI inference?

    4 min read

    Inference vs training, where it sits in the AI workload landscape, and why it's the workload most enterprises will run on-prem first.

    Open module
  2. Why your existing playbook breaks

    5 min read

    Four assumptions about capacity planning that don't survive contact with an inference workload.

    Open module
  3. How a model actually serves a request

    7 min read

    Inside the request lifecycle: model weights in memory, prefill, decode, and why the bottleneck is data movement, not compute.

    Open module
  4. The KV cache, the silent capacity killer

    6 min read

    Why each in-flight conversation needs its own working memory, and why context length and concurrency multiply.

    Open module
  5. The seven parameters that drive sizing

    8 min read

    The full input set: model size, precision, context, concurrent users, RPS, latency SLOs, burst factor.

    Open module
  6. A worked example, end-to-end

    6 min read

    A regional bank's RAG chatbot: walk through inputs, the math, and the resulting infrastructure recommendation.

    Open module
  7. The IT-and-AI planning conversation

    4 min read

    A one-page checklist: who owns which decision, what to ask the AI team, and what red flags to watch for.

    Open module
  8. Optimization techniques

    7 min read

    The AI team's six big levers. Quantization, batching, caching, LoRA, speculative decoding. What each does, how much it changes the math, when to use it.

    Open module