vLLM Systems · DevLab 2026, Deep Dive
I was recently invited by the Google TPU team to speak at the OpenXLA Summer DevLab 2026. This post breaks down our deep-dive evaluation of the matured vLLM + OpenXLA stack, the fundamental engineering mismatches between CUDA and XLA serving paths, and why traditional capacity metrics are lying to you.
If you are operating large language models at enterprise scale right now, your platform architecture team is likely staring at a massive infrastructure crossroads: Should we migrate our core serving workloads from GPUs to TPUs?
Historically, NVIDIA's CUDA ecosystem was the only serious option for user-facing, low-latency LLM generation. But here in 2026, the economics and infrastructure options have transformed. Google TPUs are highly available, cheaper per chip, and the open-source serving stack built around vLLM and OpenXLA has officially achieved absolute production readiness.
Yet, when our infrastructure team at PayPal AI Lab sat down to model this migration responsibly, we hit a frustrating wall. Virtually every public vendor benchmark is a mirage. They claim pristine performance margins, but they achieve them by subtly altering the testing environment. They swap the model architectures, tweak precision layouts, discard cold-start latency numbers, or use highly artificial client behaviors that mask massive production tail-latency variations.
We could not justify a multi-million dollar infrastructure shift on numbers that were never measured the same way. So, we built what we wished existed: an open-source, single-variable benchmark harness. We held the weights, the request mixes, the client semantics, and the target service-level objectives (SLOs) completely fixed. The only variable allowed to change was the backend server itself.
"You cannot crown an infrastructure winner using peak throughput metrics that were measured over dynamic requests while hiding the compilation wall."
The Runtime Collision: Dynamic CUDA vs. Static OpenXLA
To understand why identical model weights behave entirely differently on a GPU versus a TPU, you have to look past the hardware specifications and examine the compilation layer. It is a direct structural conflict between a runtime engineered for dynamic flexibility and a compiler engineered for hyper-optimized static matrix operations.
The GPU + CUDA serving path is inherently dynamic. Developed and hand-tuned alongside real-world chat serving for a decade, its memory allocation frameworks and execution kernels scale on-the-fly to handle text sequences of arbitrary lengths. Container startup is nearly instant; the weights load into high-bandwidth memory (HBM), the execution engine instantiates, and the first request is processed with zero ahead-of-time compilation delay.
The TPU + OpenXLA serving path is static-first. The underlying TPU architecture utilizes a Systolic Array—a high-density compute grid built specifically to stream dense matrix multiplications at maximum clock speed. To maximize hardware utility, the XLA compiler must build highly specific, rigid machine code optimized for deterministic tensor shapes ahead of execution.
This creates a severe impedance mismatch with live production serving. Real-world inference traffic is chaotic and ragged: prompts have unique lengths, outputs are highly variable, and flash-traffic arrivals are deeply bursty. To reconcile this, the XLA pipeline maps incoming requests into fixed, precompiled padding buckets (typically scaled across bounds like 128, 256, 512, or 1024 tokens). This architectural reality introduces two immediate operational side effects:
- Compute Padding: If an incoming customer prompt is 130 tokens long, it is padded up to the 256-token execution bucket. The hardware executes dummy operations over that empty padding space, resulting in wasted compute overhead.
- The Compilation Wall: The very first time the engine initializes or encounters a completely uncompiled configuration, it triggers an active ahead-of-time compilation pass that can freeze the server for several minutes to an hour.
Redefining Capacity: Raw Throughput vs. Production Goodput
Standard engineering evaluations almost exclusively track Peak Throughput (tokens per second). However, in live production environments, peak throughput is an incredibly deceptive metric. An LLM request undergoes two fundamentally separate operational phases, each hitting completely different limits on the processor chip:
1. The Prefill Phase: This phase processes the entire incoming prompt in parallel. It is heavily compute-bound and dictates your Time to First Token (TTFT)—the primary metric governing perceived application responsiveness.
2. The Decode Phase: This phase generates exactly one token at a time sequentially. Because it must fetch the entire model's parameter weights from memory for every single character generated, it is highly memory-bandwidth-bound. This phase dictates your Time Per Output Token (TPOT)—the metric governing your streaming user experience.
Because these phases hit entirely separate hardware walls, our testing framework evaluates capacity by prioritizing Goodput—defined as the exact volume of requests per second that successfully maintain both our TTFT and TPOT latency targets simultaneously. Raw throughput tracks everything, meaning a server can look incredibly healthy on paper while actively delivering completely unreadable, lagging text streams to your users.
The Experimental Framework
To establish a clean baseline, we configured our hardware testbed to run our exact, non-vendor-optimized production environment:
| Layer | Exact Stack Specification |
|---|---|
| Serving Engine | vLLM 0.13.0 (Native PyTorch / CUDA path) |
| Frameworks | PyTorch 2.9.0+cu128 · Transformers 4.57.6 · NumPy 2.1.3 |
| Drivers | CUDA 13.0 · NVIDIA Kernel Driver 580.95.05 |
| Hardware Rig | 8× NVIDIA RTX PRO 6000 Blackwell (97,887 MiB VRAM) |
| Topology | PCIe Gen5 baseboard connectivity · No physical NVLink bridge installed |
| Evaluation Models | Llama-3.1 8B (TP=1) & Llama-3.1 70B (TP=8) executed in bfloat16 |
We drove real-world operational demand using a bursty Poisson arrival distribution pattern across a sequence ladder, evaluating against standard enterprise service-level metrics: a hard compliance ceiling of TTFT ≤ 1000 ms and TPOT ≤ 50 ms.
Empirical Finding 1: Exposing the Throughput Mirage
When you chart raw token production under increasing request concurrency, our Blackwell hardware baseline delivers an exceptionally clean performance profile.
As traffic concurrency scales up to 32 requests per second, the raw execution capacity metrics climb predictably. The single-chip Llama 8B execution layer peaks out at a massive ~5,131 tokens per second, while our distributed 70B cluster hits ~2,187 tokens per second.
If you evaluated your systems using this chart alone, your infrastructure team would celebrate a successful deployment. But raw token throughput is an undifferentiated calculation. It counts tokens from requests that are actively meeting your response thresholds right alongside tokens that are stalling out and failing your users' latency expectations.
Empirical Finding 2: Interconnect Limits & Tail Latency Failures
The operational picture changes dramatically when we unpack the P99 Tail Latency curves against our strict production metrics.
As illustrated in Figure 2, the single-node Llama 3.1 8B configuration handles our scaling demand with ease, safely keeping P99 response loops capped at a comfortable 38 ms execution tier.
However, the 70B model's performance limits reveal a significant bottleneck. Because a 70B parameter footprint exceeds the available storage capacity of an isolated card, it forces active weight partitioning across our hardware nodes. This layout requires continuous, real-time synchronization loops between the chips—known as all-reduce cross-chip tensor collectives—for every hidden layer in the network.
Because our baseline cluster operates over traditional PCIe Gen5 baseboard channels without custom NVLink inter-node bridges, the inter-chip interconnect quickly saturates under load. The time required to pass tensors across the slots spikes the P99 decode tail latency to an unusable ~466 ms per token. The hardware continues to output massive amounts of text data, but the delivery stream has completely stalled.
The Usable Goodput Collapse
When we apply our operational filter and count only the volume of requests that successfully meet our latency criteria, the architectural impact of the interconnect bottleneck becomes starkly apparent.
The results in Figure 3 demonstrate the stark reality of the throughput trap. While our Llama 8B instance scales linearly and safely across the entire load spectrum, our multi-accelerator 70B configuration experiences a severe goodput collapse under high traffic density. Usable capacity drops sharply from a peak efficiency of 9.3 good requests per second down to a minimal 2.4 good requests per second.
This is where standard vendor benchmarks fail you. If your capacity management team only monitors global token-generation rates, this catastrophic degradation would look like perfectly healthy saturation. In reality, the server is burning hardware cycles producing text that falls completely outside your compliance limits.
The Hidden Variable: The OpenXLA Cold Compilation Wall
There is another critical operational hurdle that steady-state evaluations completely omit: the cold initialization penalty.
When you trigger an autoscaling event or deploy a container revision across an NVIDIA GPU cluster, initialization happens in milliseconds. Once weight tensors are successfully positioned in VRAM, the engine begins immediate, low-latency execution. The first request on a cold container returns tokens with nearly the same responsiveness as a warmed node.
Because OpenXLA relies heavily on static compilations, it handles initialization entirely differently. If your target sequence parameters or model sizes are not explicitly stored in a local persistent HLO cache, the very first incoming token request initiates a deep software compilation pass. This process freezes execution behind a 20 to 30-minute compilation wall, and even a pre-compiled cache hit requires roughly 5 minutes of startup overhead.
For horizontal infrastructure scale-out or live system failover handling, this wide variance introduces significant operational risks. If a sudden traffic spike hits your service, you cannot scale out new TPU instances on-demand; they will sit unresponsive on the compilation pass while your existing cluster buckles under load.
Under the Hood: The 2026 Unified vLLM-XLA Stack
While these runtime characteristics present real engineering challenges, the open-source software ecosystem has made massive leaps over the last year to mitigate them. Until recently, deploying on Google Cloud infrastructure required navigating a fragmented tooling landscape: managing Google's internal JetStream repository, using experimental community vLLM-TPU forks, or tracking the primary vLLM-CUDA development path.
In February 2026, the community executed a comprehensive consolidation pass. JetStream was officially archived and folded directly into a high-performance, JAX-native execution engine named tpu-inference. This refactoring delivers a remarkably clean, streamlined architecture:
- Unified Control Layer: vLLM functions as the centralized front-end orchestrator—managing the continuous token scheduler, routing requests, and providing an OpenAI-compliant API wrapper.
- JAX-Native Lowering: If your model files are optimized for JAX, they route directly to high-performance loops in
tpu-inference. Standard PyTorch code leverages Torchax to catch live execution graphs on-the-fly, mapping them to portable StableHLO structures while bypassing slow, eager Python execution layers. - Ragged Paged Attention v3 (RPA v3): To solve the compute overhead of traditional static padding buckets, the stack now relies on custom hardware kernels written in Google's low-level Pallas and Mosaic layout engines. RPA v3 supports dynamic ragged tiling inside HBM memory spaces and directly fuses KV-cache scatter steps into the primary attention block. This optimization achieves up to 86% Model Bandwidth Utilization (MBU) during streaming generation loops.
The Next Frontier: Sparse Mixture of Experts (MoE)
Our current benchmarking harness evaluates standard dense transformer topologies like Llama. However, enterprise production layouts are shifting rapidly toward sparse Mixture of Experts (MoE) frameworks like Mixtral.
MoE workloads introduce an entirely new layer of runtime complexity. A central router evaluates every incoming token and dynamically sends it to specialized expert sub-networks across the cluster. This collection, matrix execution, and scatter phase is highly irregular, data-dependent, and unpredictable.
NVIDIA's streaming multiprocessor engines handle these unpredictable routing paths exceptionally well due to their dynamic scheduling flexibility and highly optimized MoE custom kernels. Conversely, a TPU's systolic array is optimized for rigid, predictable matrix shapes, making data-dependent load balancing across the execution grid significantly more challenging. While initial vendor reports for TPU Trillium chips show strong performance, evaluating sparse models under real-world, bursty traffic distributions will be the next major engineering battleground.
The Production Playbook
The ultimate takeaway from our research isn't a simple conclusion that "GPU wins" or "TPU wins". The real lesson is that infrastructure optimization requires a strict, fair testing methodology. If your systems organization is actively auditing alternative serving platforms, you must mandate a rigorous evaluation framework:
- Standardize the Client Harness: Run your evaluation using a single, isolated load generator enforcing matching API structures and identical metric calculation code.
- Define Target SLO Compliance: Establish strict latency thresholds tailored to your product constraints and evaluate your options entirely based on Usable Goodput.
- Capture the Full Deployment Lifecycle: Explicitly clear local compilation caches before evaluating system startup times to measure the operational impact on your elastic scaling boundaries.
- Map Your Exact Cluster Interconnects: If your distributed larger models fall back to basic PCIe lanes rather than high-performance fabric paths like NVLink or TPU ICI meshes, your tail latency will inevitably drop under production loads. Evaluate at the exact scale of your production deployment topology.
If you want to review the exact deployment patterns, the complete test automation suite, or run baseline sweeps on your own hardware, the full benchmark environment code is open-sourced at github.com/rabimba/vllm-xla-bench. For those interested in tracking the presentation decks directly, you can review the original talk presentation slide PDF.
Note: This blogpost will be updated with the deep-dive video session link as soon as Google shares the recorded presentation stream on Google's YouTube channel. Stay tuned.
Original Presentation Deck
Open-Source Benchmarking Artifacts & Contacts:
• Implementation Source Code: github.com/rabimba/vllm-xla-bench
• Presentation Slides Material: devlab-vllm-xla_talk.pdf
• Hosted DevLab Event Details: openxla.org/events/summer_devlab_2026
• Professional Connection: linkedin.com/in/rabimba · rkaranjai@paypal.com

Comments
Post a Comment