Reference · worked example 1

Enterprise inferencing: a customer-facing assistant

A large bank stands up an AI assistant for its retail customers. The assistant answers product questions, retrieves account context (with consent), and hands off to a human when uncertain. It runs entirely on-prem because of regulatory constraints on customer data. This is what the deployment looks like, layer by layer.

Workload assumptions

Customer-facing assistant, on-prem, multi-rack

Daily active users: 5,000+
Peak concurrent in-flight: 200 requests
Model: Llama 3.1-70B
Precision: FP8 (E4M3)
Average prompt: 4,000 tokens (RAG-heavy)
Average output: 300 tokens
TTFT target: < 500 ms p95
TPOT target: < 40 ms p95
Redundancy: N+2 (mission-critical)
Burst factor: 2x
Deployment: On-prem, multi-rack
Serving engine: vLLM + Dynamo router

Deployment topology

Six 8-GPU HGX H200 nodes sit behind a pool of two inference-router replicas. An API gateway in front handles rate limiting and auth. The back-end fabric is InfiniBand NDR. Liquid cooling is required at the rack tier. Click any node for the full catalog entry. Toggle “Explain this architecture” for one-line annotations.

Hover any node for specs. Click for the full catalog entry.

Six 8-GPU HGX H200 nodes, paired inference routers, API gateway, parallel file system for model registry. PNG/SVG export is deferred for Phase 1; use Print for a paper copy.

Why this configuration

Why H200 over H100 or B200

The H200 carries 141 GB of HBM3e per GPU at 4.8 TB/s. That extra memory headroom (vs the H100’s 80 GB) is what lets a 70B model at FP8 fit comfortably on a single GPU and still leave room for the KV cache of long, RAG-heavy prompts. The bandwidth uplift accelerates decode, which is memory-bandwidth-bound. The B200 is the obvious next-generation choice and adds native FP4, but in mid-2026 it is more expensive per GPU and requires a stricter cooling tier for the liquid-only SKU. For a workload at this scale, the H200 is the sweet spot between capability and supply.

Why InfiniBand NDR on the back end

With 6 nodes (48 GPUs) and a router fleet that may shift KV fragments across replicas (Dynamo KV Block Manager pattern), the back-end fabric needs low-latency RDMA at 400 Gb/s per port. InfiniBand NDR with ConnectX-7 is the production-ready answer. RoCE over 400 GbE would work and may be cheaper, but it adds PFC/ECN tuning to the operational burden in exchange for no extra headline bandwidth.

Why N+2 redundancy

Customer-facing means an outage is a public incident. N+2 means we can take one node out for maintenance and still tolerate an unplanned failure during the same window. With four replicas sized for peak (and two spare), routine patching does not require a maintenance window in off-hours.

Why direct liquid cooling

An 8-way H200 chassis pulls roughly 11 kW sustained. Two per rack puts the rack past 22 kW before networking and ancillaries. Three per rack is comfortably past 30 kW, where high-density air starts to struggle. Direct-to-chip liquid (warm-water DLC at 30-45 C supply) handles this density without drama and enables free cooling in most climates. RDHX would be acceptable for the two-chassis tier but loses headroom if the workload grows.

Why a parallel file system

When a node fails and a replacement spins up, the new node needs to load 70 GB of FP8 weights as fast as possible. A parallel file system (WEKA, VAST, or equivalent) lets several replicas pull weights in parallel without serializing on a single NFS box. Local NVMe inside each node serves as the KV-cache spill tier (Dynamo KV Block Manager).

What you would change for a different workload

Three plausible variations, and what shifts in the spec sheet:

Variation	Accelerator	Fabric	Cooling
Smaller model (13B-class)	L40S, 1-2 GPUs per node	100 GbE frontend only	High-density air
Shorter context (1k prompt)	H200, fewer GPUs per replica	InfiniBand NDR (same)	RDHX would suffice
Batch / offline workload	A100 80 GB acceptable (no FP8 needed)	Ethernet frontend, no IB	Traditional air

Next step

Open the Sizer pre-filled with these exact parameters. The wizard will run the calc engine and show baseline, burst, and resilient scenarios side by side.

Size this workload in the Sizer