Reference · worked example 1
Enterprise inferencing: a customer-facing assistant
A large bank stands up an AI assistant for its retail customers. The assistant answers product questions, retrieves account context (with consent), and hands off to a human when uncertain. It runs entirely on-prem because of regulatory constraints on customer data. This is what the deployment looks like, layer by layer.
Workload assumptions
Customer-facing assistant, on-prem, multi-rack
- Daily active users
- 5,000+
- Peak concurrent in-flight
- 200 requests
- Model
- Llama 3.1-70B
- Precision
- FP8 (E4M3)
- Average prompt
- 4,000 tokens (RAG-heavy)
- Average output
- 300 tokens
- TTFT target
- < 500 ms p95
- TPOT target
- < 40 ms p95
- Redundancy
- N+2 (mission-critical)
- Burst factor
- 2x
- Deployment
- On-prem, multi-rack
- Serving engine
- vLLM + Dynamo router
Deployment topology
Six 8-GPU HGX H200 nodes sit behind a pool of two inference-router replicas. An API gateway in front handles rate limiting and auth. The back-end fabric is InfiniBand NDR. Liquid cooling is required at the rack tier. Click any node for the full catalog entry. Toggle “Explain this architecture” for one-line annotations.
Six 8-GPU HGX H200 nodes, paired inference routers, API gateway, parallel file system for model registry. PNG/SVG export is deferred for Phase 1; use Print for a paper copy.
Why this configuration
Why H200 over H100 or B200
The H200 carries 141 GB of HBM3e per GPU at 4.8 TB/s. That extra memory headroom (vs the H100’s 80 GB) is what lets a 70B model at FP8 fit comfortably on a single GPU and still leave room for the KV cache of long, RAG-heavy prompts. The bandwidth uplift accelerates decode, which is memory-bandwidth-bound. The B200 is the obvious next-generation choice and adds native FP4, but in mid-2026 it is more expensive per GPU and requires a stricter cooling tier for the liquid-only SKU. For a workload at this scale, the H200 is the sweet spot between capability and supply.
Why InfiniBand NDR on the back end
With 6 nodes (48 GPUs) and a router fleet that may shift KV fragments across replicas (Dynamo KV Block Manager pattern), the back-end fabric needs low-latency RDMA at 400 Gb/s per port. InfiniBand NDR with ConnectX-7 is the production-ready answer. RoCE over 400 GbE would work and may be cheaper, but it adds PFC/ECN tuning to the operational burden in exchange for no extra headline bandwidth.
Why N+2 redundancy
Customer-facing means an outage is a public incident. N+2 means we can take one node out for maintenance and still tolerate an unplanned failure during the same window. With four replicas sized for peak (and two spare), routine patching does not require a maintenance window in off-hours.
Why direct liquid cooling
An 8-way H200 chassis pulls roughly 11 kW sustained. Two per rack puts the rack past 22 kW before networking and ancillaries. Three per rack is comfortably past 30 kW, where high-density air starts to struggle. Direct-to-chip liquid (warm-water DLC at 30-45 C supply) handles this density without drama and enables free cooling in most climates. RDHX would be acceptable for the two-chassis tier but loses headroom if the workload grows.
Why a parallel file system
When a node fails and a replacement spins up, the new node needs to load 70 GB of FP8 weights as fast as possible. A parallel file system (WEKA, VAST, or equivalent) lets several replicas pull weights in parallel without serializing on a single NFS box. Local NVMe inside each node serves as the KV-cache spill tier (Dynamo KV Block Manager).
What you would change for a different workload
Three plausible variations, and what shifts in the spec sheet:
| Variation | Accelerator | Fabric | Cooling |
|---|---|---|---|
| Smaller model (13B-class) | L40S, 1-2 GPUs per node | 100 GbE frontend only | High-density air |
| Shorter context (1k prompt) | H200, fewer GPUs per replica | InfiniBand NDR (same) | RDHX would suffice |
| Batch / offline workload | A100 80 GB acceptable (no FP8 needed) | Ethernet frontend, no IB | Traditional air |
Next step
Open the Sizer pre-filled with these exact parameters. The wizard will run the calc engine and show baseline, burst, and resilient scenarios side by side.