Skip to main content

Posts

The Gemma 4 E2B Fine-Tuning Cookbook

A complete, opinionated recipe for adapting Gemma 4 E2B to your domain — from multimodal dataset construction and QLoRA configuration through training loop debugging, evaluation, and production deployment. April 2026  ·  ~28 min read  ·  HuggingFace / TRL / PEFT Recipe at a Glance Serves: 1 fine-tuned model Ingredients (hardware) NVIDIA GPU ≥ 24 GB VRAM System RAM ≥ 32 GB Storage (SSD) ≥ 50 GB free Training dataset 1K–100K samples Python ≥ 3.11 CUDA ≥ 12.1 Ingredients (libraries) transformers ≥ 4.51 peft ≥ 0.12 trl ≥ 0.12 bitsandbytes ≥ 0.44 accelerate ≥ 0.34 wandb / mlflow any Contents 01. Why Fine-Tune E2B? 02. Dataset Construction 03. QLoRA Configuration 04. Multimodal Fine-Tuning 05. The Training Loop 06. Debugging ...
Recent posts

Building a Multimodal Document Intelligence Pipeline with Gemma 4 E2B

A production-grade walkthrough — PDF ingestion, OCR, audio transcription, structured reasoning, and agentic tool calling — all on a 2.3B parameter model that fits on a single consumer GPU. Gemma 4 E2B Vision · OCR Audio · ASR Function Calling 128K Context April 2026  ·  ~22 min read  ·  Google DeepMind / HuggingFace Transformers Contents Model Overview & Key Architecture Choices Environment Setup & Loading the Model The Document Intelligence Pipeline Vision Pipeline — OCR & Document Parsing Audio Pipeline — ASR & Translation Configurable Thinking Mode Agentic Tool Calling Production Serving with vLLM Benchmark Results 01 — Model Overview Why Gemma 4 E2B Changes On-Device AI The Gemma 4 family is a fundamental rethinking of what a small model can do. The E2B variant — "2B effective parameters" — achieves 2.3B active parameters while c...

Write Once, Scale Everywhere

End-to-End Gemma 2B LoRA Fine-Tuning and Serving on GPU & TPU If you have ever prototyped a Large Language Model (LLM) on your local GPU and then spent days rewriting your code to scale it on a Google Cloud TPU , you know the pain of hardware lock-in. For the Google TPU Sprint, I wanted to build a solution to this exact problem. This project provides a lightweight, end-to-end pipeline for fine-tuning Google's Gemma 2B model using LoRA (Low-Rank Adaptation) and serving it via a custom REST API. By leveraging KerasNLP and the JAX backend, we can write our training and inference code once, and execute it natively on both local NVIDIA GPUs (like the RTX 6000) and Google Cloud TPUs. ⚡ Why the Keras 3 + JAX Stack? Keras 3 was rewritten to act as a "super-connector" that can run on top of PyTorch, TensorFlow, or JAX without changing the code. By explicitly setting our backend to JAX ( os.environ["KERAS_BACKEND"] = "jax" )...

Visualizing the Invisible: How Nano Banana 2 Turns Dense Science into Stunning Art

Disclaimer: As a Google Developer Expert (GDE), I was incredibly fortunate to be invited by Google DeepMind to test these models internally before their public release. The capabilities I'm sharing today are based on my hands-on early access. Have you ever stared at a dense, 15-page academic paper and wished you could just see what the researchers were talking about? As someone who frequently reads and writes heavy technical research, I face this constantly. Today, Google is introducing Nano Banana 2 (Gemini 3.1 Flash Image) . It is the latest state-of-the-art image model, and it is here to completely change how we interact with complex information. By bringing advanced world knowledge and reasoning to the high-speed Flash lineup, Nano Banana 2 dramatically closes the gap between lightning-fast generation speed and breathtaking visual fidelity. To put this to the test, I took two of my own highly technical research papers, uploaded the PDFs directly into the work...