Skip to main content

Posts

My Partner's MRI Didn't Come with a Manual, So I Built One with AI

Gemma 4 Good Hackathon · Impact Track · Health & Sciences It started with two envelopes. One contained a single sheet of paper, a radiologist's report for my partner. It was a wall of text that might as well have been written in another language. Words like " parenchymal volume ," " hyperintensities ," and " susceptibility artifact " stared back at us, creating more anxiety than they resolved. The other was a flimsy paper sleeve containing a CD-ROM. This, we were told, held the actual images from her MRI scan. The ground truth. And we couldn't even look at it. Our laptops, like most these days, don't have disc drives. For a moment, this crucial, deeply personal piece of her health information was a coaster. I felt that familiar, hot-wired frustration every engineer knows: the feeling of being locked out by a dumb problem. The powerlessness was infuriating. So, I did what any slightly obsessive software engineer would d...
Recent posts

The Hidden Performance Trap in Causal Ring Attention

When I set out to implement Ring Attention for long-context models , I hit a wall that none of the papers prepared me for. My performance was capping out at half of what it should be, no matter how many accelerators I threw at the problem. It turns out, there's a hidden performance trap in the standard recipe for causal attention . Here’s the story of that bug, how to prove it exists without burning a single TPU hour, and how a simple trick called " zigzag sharding " fixed it completely. This post is a walkthrough of ring-flash-jax , a small JAX project that explores this problem and implements the fix. We'll add the four things you actually need before Ring Attention is usable for training a real-world causal language model . TL;DR ring-flash-jax is a JAX implementation of ring attention with four key additions to the standard pattern: Causal masking — required for any decoder language model. Zigzag (striped) sharding — fixes the critical load...

Unboxing Gemma: High-Throughput Sparse Autoencoder (SAE) Training on Google Cloud TPU v5e

The Curse of Superposition: Why LLMs are Black Boxes Large Language Models like Google’s Gemma are incredibly powerful, but they suffer from a phenomenon known as Superposition . Neural networks naturally want to represent more concepts than they have mathematical dimensions. To accomplish this, they pack multiple unrelated concepts into the same neurons—a property called polysemanticity . When Gemma processes the word "Paris," it doesn't activate a neat, dedicated "City" neuron. Instead, it fires a dense, entangled vector of floating-point numbers in a 2,304-dimensional space that simultaneously represents "France," "capital," "tourism," and "linguistics." For researchers trying to build safer, steerable AI, this is a massive problem. How do we debug an AI if its internal thoughts are entangled in a dense manifold? The state-of-the-art solution is Mechanistic Interpretability via Sparse Autoencoders (SAEs...

The Gemma 4 E2B Fine-Tuning Cookbook

A complete, opinionated recipe for adapting Gemma 4 E2B to your domain — from multimodal dataset construction and QLoRA configuration through training loop debugging, evaluation, and production deployment. April 2026  ·  ~28 min read  ·  HuggingFace / TRL / PEFT Recipe at a Glance Serves: 1 fine-tuned model Ingredients (hardware) NVIDIA GPU ≥ 24 GB VRAM System RAM ≥ 32 GB Storage (SSD) ≥ 50 GB free Training dataset 1K–100K samples Python ≥ 3.11 CUDA ≥ 12.1 Ingredients (libraries) transformers ≥ 4.51 peft ≥ 0.12 trl ≥ 0.12 bitsandbytes ≥ 0.44 accelerate ≥ 0.34 wandb / mlflow any Contents 01. Why Fine-Tune E2B? 02. Dataset Construction 03. QLoRA Configuration 04. Multimodal Fine-Tuning 05. The Training Loop 06. Debugging ...

Building a Multimodal Document Intelligence Pipeline with Gemma 4 E2B

A production-grade walkthrough — PDF ingestion, OCR, audio transcription, structured reasoning, and agentic tool calling — all on a 2.3B parameter model that fits on a single consumer GPU. Gemma 4 E2B Vision · OCR Audio · ASR Function Calling 128K Context April 2026  ·  ~22 min read  ·  Google DeepMind / HuggingFace Transformers Contents Model Overview & Key Architecture Choices Environment Setup & Loading the Model The Document Intelligence Pipeline Vision Pipeline — OCR & Document Parsing Audio Pipeline — ASR & Translation Configurable Thinking Mode Agentic Tool Calling Production Serving with vLLM Benchmark Results 01 — Model Overview Why Gemma 4 E2B Changes On-Device AI The Gemma 4 family is a fundamental rethinking of what a small model can do. The E2B variant — "2B effective parameters" — achieves 2.3B active parameters while c...