Reference · worked example 2

Departmental inferencing: an internal help assistant

A 2,000-person business unit stands up an internal assistant that answers HR and IT questions, looks up policy documents, and opens tickets. The audience is internal, traffic is modest, and downtime degrades gracefully to a static help page. This is what a one-rack deployment looks like.

Workload assumptions

Internal HR/IT help assistant, single rack

Active users: 200
Peak concurrent in-flight: 10-20 requests
Model: 13B-class (e.g. Qwen3 13B)
Precision: FP8 (E4M3)
Average prompt: 1,500 tokens
Average output: 250 tokens
TTFT target: < 1,000 ms (relaxed)
TPOT target: < 50 ms
Redundancy: N+1
Burst factor: 1.5x
Deployment: Single rack, single GPU per node
Serving engine: vLLM

Deployment topology

Three single-GPU L40S servers behind one vLLM instance acting as the router. Standard air cooling. No back-end fabric, no parallel file system, no Dynamo. The whole deployment fits in one cabinet. Click any node for the full catalog entry.

Hover any node for specs. Click for the full catalog entry.

Three L40S servers, one vLLM front, local NVMe per node. PNG/SVG export is deferred for Phase 1; use Print for a paper copy.

Why this configuration

Why L40S over H100 or H200

A 13B model at FP8 weighs about 13 GB. With a 4k context window and modest concurrency, the KV cache stays well under the L40S 48 GB envelope. The L40S also fits into a standard 2U PCIe chassis with no NVLink baseboard, no liquid loop, and a 350 W per-GPU TDP. An H100 or H200 would also work and would have spare headroom for growth, but at roughly 3-4x the per-GPU cost and a TDP that pushes the rack power budget. For 13B at this scale, the L40S is exactly enough.

Why 100 GbE only

Each L40S server is a self-contained inference replica. There is no cross-node tensor parallelism, so there is no back-end fabric to build. The 100 GbE frontend carries user requests and model load traffic, and that is sufficient.

Why N+1 instead of N+2

The workload is internal and the failure mode is bounded (employees fall back to existing help portals). One spare replica gives us a maintenance window without taking the service offline, which is the operational threshold that matters here. N+2 would be over-spec for the consequence of an outage.

Why standard air cooling

Three L40S servers at 350 W each plus chassis overhead lands well under 5 kW per rack. That sits comfortably inside traditional air-cooled cabinets, so no facility-side changes are required. The deployment can land in an existing colocation row without a cooling retrofit.

What you would change for a different workload

Three plausible variations, and what shifts in the spec sheet:

Variation	Accelerator	Topology	Cooling
Bigger model (30B-class)	L40S with INT4 quant, or single H100 PCIe	Same single-GPU pattern	High-density air for H100
Higher concurrency (200 in flight)	L40S, more replicas	Add a small router (Dynamo or vLLM scheduler)	Likely still high-density air
Embeddings-only workload	L4 (72 W TDP)	1U edge server, 4-8 GPUs	Traditional air, any rack

Next step

Open the Sizer pre-filled with these exact parameters. Compare what the calc engine produces against this hand-crafted topology.

Size this workload in the Sizer