RK's Rambling

Posts

The Hidden Performance Trap in Causal Ring Attention

When I set out to implement Ring Attention for long-context models , I hit a wall that none of the papers prepared me for. My performance was capping out at half of what it should be, no matter how many accelerators I threw at the problem. It turns out, there's a hidden performance trap in the standard recipe for causal attention . Here’s the story of that bug, how to prove it exists without burning a single TPU hour, and how a simple trick called " zigzag sharding " fixed it completely. This post is a walkthrough of ring-flash-jax , a small JAX project that explores this problem and implements the fix. We'll add the four things you actually need before Ring Attention is usable for training a real-world causal language model . TL;DR ring-flash-jax is a JAX implementation of ring attention with four key additions to the standard pattern: Causal masking — required for any decoder language model. Zigzag (striped) sharding — fixes the critical load...

Unboxing Gemma: High-Throughput Sparse Autoencoder (SAE) Training on Google Cloud TPU v5e

The Curse of Superposition: Why LLMs are Black Boxes Large Language Models like Google’s Gemma are incredibly powerful, but they suffer from a phenomenon known as Superposition . Neural networks naturally want to represent more concepts than they have mathematical dimensions. To accomplish this, they pack multiple unrelated concepts into the same neurons—a property called polysemanticity . When Gemma processes the word "Paris," it doesn't activate a neat, dedicated "City" neuron. Instead, it fires a dense, entangled vector of floating-point numbers in a 2,304-dimensional space that simultaneously represents "France," "capital," "tourism," and "linguistics." For researchers trying to build safer, steerable AI, this is a massive problem. How do we debug an AI if its internal thoughts are entangled in a dense manifold? The state-of-the-art solution is Mechanistic Interpretability via Sparse Autoencoders (SAEs...

The Gemma 4 E2B Fine-Tuning Cookbook

A complete, opinionated recipe for adapting Gemma 4 E2B to your domain — from multimodal dataset construction and QLoRA configuration through training loop debugging, evaluation, and production deployment. April 2026 · ~28 min read · HuggingFace / TRL / PEFT Recipe at a Glance Serves: 1 fine-tuned model Ingredients (hardware) NVIDIA GPU ≥ 24 GB VRAM System RAM ≥ 32 GB Storage (SSD) ≥ 50 GB free Training dataset 1K–100K samples Python ≥ 3.11 CUDA ≥ 12.1 Ingredients (libraries) transformers ≥ 4.51 peft ≥ 0.12 trl ≥ 0.12 bitsandbytes ≥ 0.44 accelerate ≥ 0.34 wandb / mlflow any Contents 01. Why Fine-Tune E2B? 02. Dataset Construction 03. QLoRA Configuration 04. Multimodal Fine-Tuning 05. The Training Loop 06. Debugging ...

Building a Multimodal Document Intelligence Pipeline with Gemma 4 E2B

A production-grade walkthrough — PDF ingestion, OCR, audio transcription, structured reasoning, and agentic tool calling — all on a 2.3B parameter model that fits on a single consumer GPU. Gemma 4 E2B Vision · OCR Audio · ASR Function Calling 128K Context April 2026 · ~22 min read · Google DeepMind / HuggingFace Transformers Contents Model Overview & Key Architecture Choices Environment Setup & Loading the Model The Document Intelligence Pipeline Vision Pipeline — OCR & Document Parsing Audio Pipeline — ASR & Translation Configurable Thinking Mode Agentic Tool Calling Production Serving with vLLM Benchmark Results 01 — Model Overview Why Gemma 4 E2B Changes On-Device AI The Gemma 4 family is a fundamental rethinking of what a small model can do. The E2B variant — "2B effective parameters" — achieves 2.3B active parameters while c...