The GPU that powers a language model matters more than most developers realize. The choice of hardware affects training time, inference latency, cost per token, and ultimately what models are economically feasible to run. In 2026, that hardware conversation is dominated by one name: NVIDIA. And the two architectures shaping the present and near future are Blackwell — already deployed at massive scale — and Vera Rubin, arriving later this year with numbers that seem almost implausible on paper.
This guide cuts through the marketing to give AI engineers what they actually need to know: specs, real-world performance, cost implications, and what it means for how you build and deploy AI systems.
NVIDIA Blackwell: The Current Standard
The Blackwell architecture, introduced in 2024 and fully deployed through 2025–2026, represents NVIDIA's most significant architectural leap since the A100. The flagship chips — B200 and GB200 — are what cloud providers and AI labs are running right now.
Blackwell Technical Specs: B200 vs GB200
Understanding the difference between B200 and GB200 matters for procurement decisions:
| Spec | B200 (single GPU) | GB200 (Grace + Blackwell) |
|---|---|---|
| GPU Die | Blackwell B200 | Blackwell B200 + Grace CPU |
| AI Training (FP8) | 20 petaflops | 20 petaflops (GPU portion) |
| AI Inference (FP4) | 40 petaflops | 40 petaflops (GPU portion) |
| HBM3e Memory | 192 GB | 192 GB GPU + 480 GB LPDDR5X |
| Memory Bandwidth | 8 TB/s | 8 TB/s GPU + 1 TB/s CPU |
| TDP | 1000W | 1200W (combined) |
| Interconnect | NVLink 5 | NVLink-C2C (900 GB/s CPU-GPU) |
| Best For | Large-scale training clusters | Inference at scale, hybrid workloads |
The NVL72: When 72 GPUs Become One
The most powerful Blackwell deployment is not a single GPU — it is the NVL72, a rack-scale system containing 72 B200 GPUs interconnected via NVLink 5 with a combined 130 TB/s of aggregate bandwidth. This is not just a cluster of GPUs — the interconnect is fast enough that the entire NVL72 behaves as a single unified compute unit.
What this makes possible:
- Running trillion-parameter models without model parallelism overhead
- Training massive mixture-of-experts models with near-linear scaling
- Serving multiple large models simultaneously with hot-swapping
- Microsecond-latency communication between all 72 GPUs
Who Is Actually Using NVL72?
Major cloud providers (AWS, Azure, GCP) and AI labs (OpenAI, Anthropic, Google DeepMind) are the primary NVL72 customers. The system requires specialized liquid cooling infrastructure and 800V DC power — this is data center-scale hardware, not something you run in a colocation facility.
FP4: Why 4-Bit Precision Changes the Economics
One of Blackwell's most consequential features is native FP4 (4-bit floating point) support in its 5th Generation Tensor Cores. To understand why this matters, you need to understand how precision affects AI workloads:
| Precision | Bits | Use Case | Throughput vs FP32 |
|---|---|---|---|
| FP32 | 32 | Research, highest accuracy | 1× |
| BF16 | 16 | Standard training | ~2× |
| FP8 | 8 | Inference (Hopper era standard) | ~4× |
| FP4 | 4 | Inference (Blackwell native) | ~8× |
The practical result: running LLM inference on Blackwell with FP4 quantization can achieve roughly twice the throughput of running the same model on Hopper (H100) with FP8, at similar or better accuracy for most tasks. For organizations spending millions on inference compute, this is the difference between feasibility and infeasibility.
Python — FP4 Inference with TensorRT-LLMimport tensorrt_llm from tensorrt_llm.quantization import QuantMode # Enable FP4 quantization for Blackwell inference quant_mode = QuantMode.from_description( quantize_weights=True, quantize_activations=True, per_token=True, per_channel=True, use_fp4=True # Blackwell B200/GB200 native FP4 ) # Build engine with FP4 builder = tensorrt_llm.Builder() builder_config = builder.create_builder_config( precision="fp4", quant_mode=quant_mode, max_batch_size=64, max_input_len=4096, max_output_len=2048 ) # Result: ~2x throughput vs FP8 on H100 # ~8x throughput vs FP32 baseline
NVIDIA Vera Rubin: What's Coming Late 2026
While Blackwell is the present, Vera Rubin is what NVIDIA has already announced for H2 2026. The numbers are extraordinary even by NVIDIA's standards:
| Feature | Blackwell B200 | Rubin GPU (Announced) | Improvement |
|---|---|---|---|
| NVFP4 Inference | ~40 petaflops | 50 petaflops | +25% |
| Inference Throughput | Baseline | 10× higher | 10× |
| Cost per Token | Baseline | 10× lower | 10× |
| MoE Training Efficiency | Baseline | 4× fewer GPUs needed | 4× |
| Memory | HBM3e | HBM4 | Higher BW |
| Interconnect | NVLink 5 (1.8 TB/s) | NVLink 6 (3.6 TB/s) | 2× |
| CPU Pairing | Grace (72× Arm Neoverse) | Vera (88× custom Olympus) | +22% |
Treat These Numbers Carefully
The 10× inference improvement claim is based on NVIDIA's own benchmarks using Mixture-of-Experts model architectures. Real-world improvements for dense transformer models will be smaller. Wait for independent benchmarks before making procurement decisions based on these figures.
The Vera Rubin Platform: More Than Just a GPU
Vera Rubin is not just a new GPU — it is a complete platform redesign:
- Rubin GPU: The compute engine, targeting 50 petaflops NVFP4 per chip
- Vera CPU: 88 custom Olympus cores, purpose-built for AI orchestration workloads
- NVLink 6: 3.6 TB/s per GPU — double Blackwell's already-impressive 1.8 TB/s
- HBM4 memory: Higher bandwidth than HBM3e, enabling larger models in-memory
- NVLink Switch 6: Rack-scale connectivity enabling hundreds of GPUs to operate as unified compute
Should You Wait for Vera Rubin?
This is the practical question for anyone making infrastructure decisions right now:
| Situation | Recommendation |
|---|---|
| Building production inference now | Use Blackwell. Don't wait — Rubin availability will be limited at launch. |
| Planning 2027 infrastructure | Design for Rubin compatibility but build with Blackwell. |
| Training frontier models | Blackwell is the right choice for current-generation models. |
| Cost-sensitive inference at scale | Consider waiting — 10× cost per token improvement is significant. |
| Research and experimentation | Cloud access to Blackwell via AWS/Azure/GCP is immediately available. |
Beyond NVIDIA: AMD MI300X and Google TPU v5
The AI hardware market is not a NVIDIA monopoly, though NVIDIA's dominance is real:
- AMD MI300X: Competitive on memory capacity (192 GB HBM3 in a single chip), increasingly supported by PyTorch/ROCm. Best for memory-bound workloads. Still lags on software ecosystem maturity.
- Google TPU v5e/v5p: Highly optimized for Google's own JAX/XLA stack. Cost-competitive on Google Cloud. Limited portability if you want to run code elsewhere.
- Cerebras CS-3: Wafer-scale chip with massive SRAM — extraordinary for specific research workloads but niche use case.
- Groq LPU: Ultra-low latency inference chip. Excellent for real-time inference applications where token latency matters more than throughput.
Practical Advice for AI Engineers
For most teams, the decision is not which GPU to buy — it is which cloud provider to use. AWS (H200/B200 via P5 instances), Azure (ND H200 v5), and GCP (A3 Ultra with H200) all offer Blackwell-generation hardware today. Start with spot/preemptible instances to reduce cost during experimentation before committing to reserved capacity.
What Blackwell Means for AI Inference Costs
The economics of AI have been shifting rapidly as hardware improves. Here is what Blackwell's FP4 throughput means in practical terms for LLM inference costs compared to the H100 era:
- GPT-5.5 class models: Running inference on Blackwell B200 clusters costs roughly 40–50% less per token than equivalent H100 infrastructure at the same scale.
- 70B parameter models: Can now be served on a single GB200 with comfortable headroom, eliminating tensor parallelism overhead.
- Mixture-of-Experts models: Blackwell's NVLink bandwidth makes MoE inference dramatically more efficient — the biggest beneficiary of the architecture.
- Batch inference: High-throughput batch jobs see the largest cost reductions — up to 60% cheaper per token vs. H100 in optimal configurations.
Conclusion: Hardware Is Competitive Advantage
In the early years of the LLM era, the competitive advantage in AI was largely about models — who had the best architecture, the most data, the smartest researchers. In 2026, as model architectures converge and training techniques become widely understood, hardware efficiency is increasingly the differentiator. Teams that understand how to extract maximum performance from Blackwell GPUs — through FP4 quantization, optimal batch sizes, NVLink topology-aware parallelism — have a real advantage over teams that treat hardware as a commodity. And when Vera Rubin arrives, the teams who have invested in understanding the hardware layer will be the first to harness its 10× inference efficiency gains.