vLLM Systems · DevLab 2026, Deep Dive I was recently invited by the Google TPU team to speak at the OpenXLA Summer DevLab 2026 . This post breaks down our deep-dive evaluation of the matured vLLM + OpenXLA stack, the fundamental engineering mismatches between CUDA and XLA serving paths, and why traditional capacity metrics are lying to you. If you are operating large language models at enterprise scale right now, your platform architecture team is likely staring at a massive infrastructure crossroads: Should we migrate our core serving workloads from GPUs to TPUs? Historically, NVIDIA's CUDA ecosystem was the only serious option for user-facing, low-latency LLM generation. But here in 2026, the economics and infrastructure options have transformed. Google TPUs are highly available, cheaper per chip, and the open-source serving stack built around vLLM and OpenXLA has officially achieved absolute production readiness. Yet, when our infrast...
Racecraft · Part 3 of 5 · ← Prologue Splitting the Brain to Beat the Clock How a "brake!" lands in 5 milliseconds while a cloud model thinks for five seconds — in the same app, on the same frame, without ever colliding. Two posts in, we have a coach that knows who's driving and what to say. This post is about the only thing that lets it say anything useful: structure. Specifically, the decision to give the system not one brain but three, each on its own clock, with an ironclad rule about which one is allowed to make the driver wait. I call it the Split-Brain engine , and the whole design collapses out of one observation. The three jobs a coach does — react, strategize, prepare — have wildly different deadlines. Trying to serve all three from one code path means the fastest job inherits the latency of the slowest. That's the original sin of every cloud-first coaching app. So I refused to let them share a path. The Spli...