Research note · 2026-05-18
The hook is real on the surface, in two directions. MI300X rents for $1.50/GPU-hr at TensorWave while H100 sits at $2.99 on Lambda — literally half. And AMD chips are clickable on-demand today at 5+ neoclouds, while NVIDIA H100 is 36–52 weeks direct purchase and B200 is allocated through H2 2027 to Microsoft / Google / Meta / Amazon. But load in three years of power, cooling, and ops and MI300X ends up 3% more expensive than H100 at ownership — the rental discount is mostly neocloud silicon depreciation, not a per-node TCO advantage. Where AMD actually wins is HBM capacity per dollar: MI300X delivers HBM at $498/GB vs H100's $1,162/GB. That 2.3× edge is what makes Llama 3.1 405B fit on a single node instead of two. This page walks through what's real, what's marketing, and where the workload line actually falls.
The surface claim
MI300X rents for $1.50/GPU-hr at TensorWave, $1.71 on Crusoe spot, $1.85 at Vultr. H100 SXM rents for $2.99 at Lambda and $3.49 at RunPod. The discount lives at five providers — TensorWave, Hot Aisle, Vultr, Crusoe spot, AMD Dev Cloud — and disappears at Azure ($6/hr) and Oracle ($6/hr) where MI300X is listed at H200 prices.
SemiAnalysis InferenceMAX finds MI300X needs to rent under $1.90/GPU-hr to win $/M tokens on Llama 3 70B chat against H200 + TensorRT-LLM, and MI325X under $2.50/GPU-hr. TensorWave ($1.50), Crusoe spot ($1.71) and Vultr ($1.85) clear the MI300X bar; TensorWave MI325X at $1.95 clears the MI325X bar. Median MI300X is $1.99 vs H100 median $3.49 (43% lower) — but median is pulled up by Azure / Oracle at $6/hr, where the AMD discount has been competed away.
The other surface claim
The rental discount is one half of the story. The other half — bigger, structurally — is supply. NVIDIA H100 is 36–52 weeks direct purchase, H200 reserved pools are sold out, and B200 is allocated through H2 2027 to Microsoft / Google / Meta / Amazon. AMD MI300X / MI325X / MI355X are clickable on-demand today at 5+ neoclouds. Same TSMC CoWoS bottleneck; different allocation policy.
Both AMD and NVIDIA run through the same TSMC CoWoS packaging bottleneck for HBM-stacked accelerators — capacity is fully allocated through mid-2027 (Spheron). What's different is allocation policy: Microsoft, Google, Meta, and Amazon placed multi-billion-dollar forward orders for Blackwell in 2025 and consumed most of NVIDIA's 2026-27 supply (Lyceum). AMD has no equivalent pre-commitment regime — Meta's 6 GW MI450 deal lands H2 2026, but MI300X / MI325X are not blocked by it. That's the structural reason a startup can swipe a card for 8 MI300X at Crusoe today but waits 9+ months for an H200.
Sources: lead times via Spheron GPU shortage 2026 brief; MI300X availability at Crusoe; MI325X at TensorWave; MI355X via Phoronix and Tom's Hardware on TensorWave's 8,192-GPU cluster. NVIDIA allocation reporting via Lyceum and Spheron, summarizing SemiAnalysis / Bloomberg coverage.
What you're missing — part 1
Silicon capex is one column of the bar. Load in three years of power, the cooling stack the rack actually requires, and a flat $150K/node-yr ops envelope, and the five configurations land inside a $137K spread — about 16% of the average. MI300X comes out 3% more expensive than H100 at ownership. The AMD pitch is not lower TCO. It's much more HBM per TCO dollar.
3-yr window, 85% duty cycle, $0.07/kWh blended industrial. Silicon capex from channel listings (Supermicro AS-8125GS-TNMR2 for AMD; HGX list for NVIDIA). Cooling delta scales $50/GPU/yr per 100W over a 700W baseline. Ops fixed at $150K/node/yr. TCO totals: MI300X $765K, H100 $744K, H200 $789K, MI355X $851K, B200 $881K — $137K spread across all five.
What you're missing — part 2
The 1/2 capex framing flattens the wrong axis. The right axis is HBM capacity per dollar of rental, where MI300X / MI325X / MI355X each sit ahead of every NVIDIA part on the page. That capacity is what makes Llama 3.1 405B fit on a single node, not two — and it's what drives the off-grid economics deeper in the page.
Silicon list price against HBM capacity. Up and to the left is better. AMD's three Instinct parts all sit above every NVIDIA isoprice contour — MI300X delivers HBM at $94/GB of silicon list, compared to $375/GB for H100. AMD mean is $106/GB vs NVIDIA mean $291/GB — a 64% edge that compounds at every level above silicon.
Silicon unit price is channel / street, May 2026: MI300X $18K, MI325X $22K, MI355X $40K, H100 $30K, H200 $35K, B200 $45K. Dashed contours show equal-$/HBM-GB curves at the silicon level. AMD's three parts all sit above the $150/GB contour; NVIDIA's sit at $250+.
May 2026 datasheet head-to-head. MI300X has 2.4× the HBM of an H100 and 58% more bandwidth, but lost on FP8 compute by ~32%. MI355X is the first AMD part that closes the compute gap on paper — 2.2× B200 FP8 and FP4 (peak), at +40% TDP.
Sources: AMD product pages for MI300X, MI325X, MI355X; NVIDIA H100 / H200 / B200 datasheets; channel pricing via IntuitionLabs and Silicon Analysts. Dense numbers, no sparsity.
MI300X shipped in late 2023. By late 2025, AMD had silicon (MI300X → MI325X → MI355X), software (ROCm 7 + AITER), and marquee buyers (Meta serving Llama 3.1 405B, OpenAI's 6 GW MI450 deal). Three tracks, one trajectory.
AMD wins when HBM capacity or bandwidth is the binding constraint (large dense models, single-stream latency-bound chat). NVIDIA wins when software composability binds (MoE expert-parallel, disaggregated prefill/decode, FP4 with TensorRT-LLM). MI355X is the first AMD part that closes the compute gap on paper — early benchmarks land within 10% of B200 on gpt-oss-class workloads.
Dense, capacity-binding
1×8 MI300X fits the model; ≥$/Mtok competitive when rented <$1.99/GPU-hr
H100 needs TP=16 (2 nodes); H200 needs TP=8 but tighter HBM
Memory-bound, single stream
MI300X wins $/Mtok if priced under $1.90/GPU-hr (now true on TensorWave / Hot Aisle / Vultr)
H200 + TRT-LLM catches up beyond ~60s latency budget
MoE, expert-parallel
Wide-EP + disaggregated prefill/decode on B200 + Dynamo wins across most points
AMD is >6 months behind on open-source distributed inferencing
Dense, FP4 native
MI355X: 2.55M tok/s/MW per SemiAnalysis at matched interactivity
B200 + Dynamo TRT-LLM: 2.8M tok/s/MW, ~10% edge
Compute-bound, kernel-sensitive
TRT-LLM + FP8 kernels close the per-GPU gap; ecosystem advantage dominates
AMD HBM advantage doesn't bind; software tax > capex savings
Mixed batching
MI300X ≈ H100; MI325X ≈ H200 at submitter-tuned configs
B200 ≈ 3× H200 — Blackwell is the headline winner that round
When the workload is HBM-binding and software composability is simple, AMD wins. When composability binds — MoE expert-parallel, disaggregated prefill / decode, FP4 with Dynamo — NVIDIA wins. The two shaded regions are the structural shape of AMD's 2026 opportunity.
The AMD bull case sits in the top-left: dense large-context models, simple serving, single-node deploys. That's exactly where Meta chose MI300X for Llama 3.1 405B and where Oracle's benchmark beat H100. The NVIDIA bull case is everywhere with software intensity — MoE serving, disaggregated prefill / decode, FP4 + Dynamo. If your workload looks like the top-left, the 1/2 rental discount carries through to $/M tokens. If it looks like the right column, the software tax exceeds the per-hour discount.
ROCm software state
ROCm 7.2.3 + AITER attention backend + vLLM upstream support is the May 2026 production stack. Meta and Azure-OpenAI proof-points show the gap is now narrow enough to matter only at the edges.
Five subsystems matter for production inference. vLLM and FlashAttention are essentially at parity; AITER is the fast-closing TensorRT-LLM equivalent; disaggregated prefill/ decode is still the biggest open gap.
AITER backend lifts AMD perf 1.2–4.4× over legacy; H200 still leads ~12% in most shapes
Source →Day-1 DeepSeek V3/R1 on AMD; CI parity still <10% of NVIDIA, ~25% of models fail accuracy
Source →Monthly kernel releases; DeepSeek-V3 ~2× w/ AITER; not as turnkey as TRT-LLM
Source →Official MI200/MI300 support via CK + Triton backends, head dim up to 256, FP8 in Triton
Source →AMD reference architecture exists for SGLang+MI300X but >6 months behind NVIDIA Dynamo
Source →ROCm 7. GA September 2025; ROCm 7.2.3 is the May 2026 production target. AMD's own benchmarks claim 3.5× inference uplift vs ROCm 6, with first-class MI350X / MI355X support, PyTorch 2.7, Triton 3.3 (AMD ROCm 7 launch).
vLLM is the de facto runtime. Upstream vLLM ships a prebuilt ROCm Docker image; the February 2026 AITER attention backend delivers 1.2–4.4× throughput over legacy ROCM_ATTN (vLLM blog). SGLang has day-one DeepSeek V3/R1 support on MI300X / MI325X / MI355X with FP8 / MXFP4 / AWQ. Disaggregated prefill/decode is now an official AMD reference architecture (AMD ROCm blog).
The honest gap. SemiAnalysis flags that AMD is still >6 months behind on open-source distributed inferencing — wide-EP plus FP4 plus disaggregation on B200 + Dynamo wins MoE serving today. SGLang+ROCm CI coverage is <10% of NVIDIA's, and ~25% of tested models still fail accuracy checks on the AMD backend (SemiAnalysis inference deep-dive). The pattern: dense large-context wins, MoE-disagg loses, sub-70B loses on software composability.
Off-grid implications
An off-grid AI campus has a fixed MW envelope. TDP-per-GPU walks straight through to HBM-per-MW and to the cooling stack — and to the facility capex band you have to build.
An off-grid AI campus has a fixed MW envelope set by generation + BESS. The TDP-per-GPU walks straight through to GPUs-per-MW and HBM- per-MW. AMD's MI300X is air-cooled-friendly drop-in at 750W (similar to H100's 700W). MI355X at 1400W needs full direct-to- chip liquid cooling — that pushes facility capex toward the $20M+/MW band even before GPUs.
MI300X delivers ~2.2× the HBM/MW of H100 — every megawatt of off-grid generation lands as twice the model-weight capacity. But MI355X actually has less HBM/MW than MI300X because TDP rose faster than HBM per chip. For pure off-grid capacity arithmetic, MI300X stays the sweet spot.
Llama 3.1 405B in FP16 needs ~810 GB of weights, so each instance needs ceil(810 / chip HBM) GPUs in tensor-parallel. MI300X delivers ~19× more 405B instances per off-grid MW than H100 — both the per-GPU HBM advantage and the per-instance GPU count drop compound.
8-GPU node = 6 kW IT — five nodes per rack, CRAH-based aircool works.
8-GPU node = 8 kW IT — rear-door heat exchangers or direct-to-chip on hot rows.
11.2 kW per 8-GPU node, plus CPU + switch — every rack needs DLC manifolds.
MI300X is the cleanest off-grid AMD fit. Same 750W TDP class as H100, so any air-cooled mining-shed conversion ($1.5–3M/MW with usable power delivery) or air-cooled inference shed ($3–6M/MW) drops MI300X in without re-engineering thermal. You get 2.2× HBM/MW and 58% more bandwidth at the same MW footprint.
MI325X needs a thermal upgrade. 1000W matches B200, so it lives in the same hybrid-cool / liquid-assist tier ($8–14M/MW). Off-grid campuses built for H100 need rear-door heat exchangers or rack-level water loops added.
MI355X is hyperscaler-only off-grid. 1400W requires direct-to-chip liquid manifolds at every rack, pushing the facility into the $18–25M/MW band. Off-grid penciling at that capex needs hyperscale-tier offtake — neocloud / mining-conversion economics break.
GPUs/MW math assumes PUE 1.2 and 80% of IT power to GPUs (rest = CPU, NIC, switch). Cooling tier bands cross-reference the four facility capex tiers in the inference economics model. Llama 3.1 405B FP16 weight size sourced from Hugging Face model card.
Run the model on AMD
Same throughput-from-first-principles model as the standalone calculator, defaulted to MI300X 192GB + Llama 4 Maverick. Toggle to MI325X / MI355X to see the HBM headroom; toggle to H100 / H200 / B200 to compare. Every input — facility capex, electricity, batch size, duty cycle — flows through to $/M tokens and payback.
20 MW · 17,760 × MI300X 192GB · Llama 4 Maverick 400B/17B MoE (INT4) · 8,880 instances (2 GPUs each) · batch 64
Batch curve. At batch=1 you get 1247 tok/s/instance (memory-bound). Compute-bound ceiling is 153765 tok/s/instance. The transition (b*) sits at batch ≈ 123. You're running at batch=64, which delivers 11146 tok/s/instance after real-world overhead (efficiency ~30% vs theoretical, calibrated to vLLM benchmarks).
Electricity is 2% of costs. GPU depreciation dominates. The electricity arbitrage helps but isn't the main driver.
Cost per million tokens: $0.0485. Selling at $0.210/M. That's a 333% markup.
Capex comparison. A 20 MW fleet costs $366.40M vs $500.00M at hyperscaler rates ($25M/MW). That's $133.60M in savings, or 27% cheaper.
Throughput. 2340.9T tokens/year at 75% duty cycle across 8,880 instances of Llama 4 Maverick 400B/17B MoE (INT4).
Throughput model: per-instance aggregate t/s = T_compute · b/(b+b*) · η · 1/√N, where T_compute = (GPU_TFLOPS · N) / (2·active_params), b* = T_compute / T_memory, T_memory = (GPU_BW · N) / (active_params · 0.5 bytes), η = 0.3 (real-world overhead — KV reads, sampling, dequant, framework, calibrated against vLLM Llama-70B INT4 H100 ≈ 2,200 tok/s @ b=128), N = ceil(model_vram / gpu_vram). Assumes continuous batching + INT4 quant + tensor-parallel sharding when N>1. 80% of IT power to GPUs. Facility depreciated 10y. Excludes financing, tax, land. Real production batch sizes are KV-cache-limited; observed effective batch is typically 32–128 even with max_num_seqs set higher.
Production deployments
Until late 2025, AMD inference at scale was a hypothesis with one major customer (Meta). The OpenAI 6 GW MI450 deal in October 2025 and the Meta 6 GW MI450 deal in February 2026 reset the trajectory — by 2027, AMD is a real second source, not a hedge.
Production inference — Llama 3, Llama 4 Maverick (>11k tok/s/node)
Source →Bare-metal inference rental
Source →Multi-year partnership; 160M AMD share warrant
Source →1 GW H2 2026 ship; second mega-deal in 5 months
Source →On-demand inference at $1.50–$2.29/GPU-hr
Source →Run the numbers for your specific deployment. MI300X / MI325X / MI355X are now selectable GPU options in the inference economics model. The throughput physics — memory-bound floor at batch=1, compute-bound ceiling at batch=∞, saturation batch b* — derives from first principles for any of the six chips.
The off-grid pairing. If you're developing an off-grid AI site (mining-shed conversion, ERCOT BYOP, stranded-gas BTM, or a flex-load campus from the off-grid feasibility chart), MI300X is the cleanest drop-in. Same air-cooled thermal envelope as H100, 2.2× HBM/MW. MI355X requires the $18M+/MW hyperscale liquid-cooled stack — only economic with hyperscale-tier offtake.
The rental-vs-own break. At sub-$2/GPU-hr MI300X rentals (TensorWave, Hot Aisle, Vultr), the capex saving carries through to $/M tokens on the right workloads. Above $2.50/GPU-hr (hyperscaler list, RunPod), the software tax exceeds the per-hour discount. The arbitrage is real and currently sized at ~50% per GPU-hour — but it only lives at five providers, not at Azure or Oracle.