Calculator · Internal
Capex, opex, throughput, margin, and payback for a GPU inference fleet, parametrized by model size, utilization, electricity price, and facility cost. Tweak inputs on the left to feel the curve.
20 MW · 19,040 × H100 80GB · Llama 4 Maverick 400B/17B MoE (INT4) · 6,346 instances (3 GPUs each) · batch 64
Batch curve. At batch=1 you get 1182 tok/s/instance (memory-bound). Compute-bound ceiling is 174618 tok/s/instance. The transition (b*) sits at batch ≈ 148. You're running at batch=64, which delivers 9144 tok/s/instance after real-world overhead (efficiency ~30% vs theoretical, calibrated to vLLM benchmarks).
Electricity is 2% of costs. GPU depreciation dominates. The electricity arbitrage helps but isn't the main driver.
Cost per million tokens: $0.1012. Selling at $0.210/M. That's a 108% markup.
Capex comparison. A 20 MW fleet costs $442.72M vs $500.00M at hyperscaler rates ($25M/MW). That's $57.28M in savings, or 11% cheaper.
Throughput. 1372.5T tokens/year at 75% duty cycle across 6,346 instances of Llama 4 Maverick 400B/17B MoE (INT4).
Throughput model: per-instance aggregate t/s = T_compute · b/(b+b*) · η · 1/√N, where T_compute = (GPU_TFLOPS · N) / (2·active_params), b* = T_compute / T_memory, T_memory = (GPU_BW · N) / (active_params · 0.5 bytes), η = 0.3 (real-world overhead — KV reads, sampling, dequant, framework, calibrated against vLLM Llama-70B INT4 H100 ≈ 2,200 tok/s @ b=128), N = ceil(model_vram / gpu_vram). Assumes continuous batching + INT4 quant + tensor-parallel sharding when N>1. 80% of IT power to GPUs. Facility depreciated 10y. Excludes financing, tax, land. Real production batch sizes are KV-cache-limited; observed effective batch is typically 32–128 even with max_num_seqs set higher.
Facility capex benchmarks
Four realistic build profiles, $0.5M/MW (crypto shed) to $30M/MW (hyperscale AI). Each tier shows the cost band and 3–4 cited comps so you can pick a number that matches the actual deployment you're modeling, not a hand-waved average.
Industrial sheds built for ASICs. Single grid feed, no UPS, evaporative or pure outdoor air cooling. Not suitable for sustained AI inference without retrofit, but the land + shell + interconnect are sunk costs you can recover.
Lower-tier purpose-built, single grid feed, CRAH-based air cooling at 25–35 kW/rack, basic gen backup, modest UPS. Inference SLAs (99–99.9%) tolerate single-feed risk. This is the realistic tier for an air-cooled A100 / H100 deployment in West Texas.
Conventional colo / wholesale build. N+1 to 2N power redundancy, full UPS, water-cooled chillers, certified to Uptime Tier III. The default reference build for an Equinix / Digital Realty / CyrusOne wholesale deal.
Direct-to-chip liquid cooling, 80–130 kW/rack rack density, dedicated substation, 2N power, multi-100 MW campus. The build profile for a Crusoe / Microsoft / Meta AI factory designed to host GB200 NVL72 racks.
Numbers are facility capex only — building shell, MEP, power infrastructure, cooling, fire/security, gen backup. GPU hardware, interconnect upgrades, and land are separate. The hyperscale band excludes GPUs (GB200 racks add ~$3M/rack on top, financed separately). For an air-cooled A100 inference build in West Texas, the $3–6M/MW band is the right starting point; drop to $1.5–3M/MW if you're acquiring a converted crypto shell with usable power delivery already in place.