Skip to main content

My Partner's MRI Didn't Come with a Manual, So I Built One with AI

Gemma 4 Good Hackathon · Impact Track · Health & Sciences

It started with two envelopes.


One contained a single sheet of paper, a radiologist's report for my partner. It was a wall of text that might as well have been written in another language. Words like "parenchymal volume," "hyperintensities," and "susceptibility artifact" stared back at us, creating more anxiety than they resolved.

The other was a flimsy paper sleeve containing a CD-ROM. This, we were told, held the actual images from her MRI scan. The ground truth. And we couldn't even look at it.

Our laptops, like most these days, don't have disc drives. For a moment, this crucial, deeply personal piece of her health information was a coaster. I felt that familiar, hot-wired frustration every engineer knows: the feeling of being locked out by a dumb problem. The powerlessness was infuriating.

So, I did what any slightly obsessive software engineer would do. I went on Amazon and, with a grim sense of purpose, ordered a $20 external DVD player. It felt like an analog solution to a digital problem, and it was the first step down a rabbit hole that would consume me for months.

▶ Prefer to watch first? 3-minute demo on YouTube

The Firehose and the Flipbook

When the drive arrived, I plugged it in and was greeted not by JPEGs, but by hundreds of .dcm files. I had just stumbled into the world of DICOM, the universal, arcane language of medical imaging.

After finding a free viewer, I finally saw the "scan." It wasn't one picture. It was a dataset. Over 500 individual slices. Flicking through them felt like watching a grainy, black-and-white flipbook of my partner's brain.

This wasn't clarity. This was a data firehose. How could anyone possibly find a tiny anomaly in this sea of gray?

I realized that a radiologist's superpower isn't just their ability to see. It's their ability to ignore. They use the clinical context to know where to look, which of the dozen-plus series to focus on, and how to tune their eyes. They aren't human search engines; they are detectives following leads.

And I thought, "That's it. That's what an AI needs to do. It needs to be a detective."



Dr.MRI.AI — the project that grew out of that hospital corridor.

What I built

Dr.MRI.AI is a privacy-first DICOM viewer that uses Gemma 4 to choose the evidence before reviewing it. You drag your folder onto the page, type a clinical question in plain English, and the model returns a structured plan — which series, which slice range, which window/level — before it ever looks at a pixel. The plan is shown to you. You can accept it, edit it, or reject it. Only then does the multimodal review run, on the focused subset.

200+ 16 A representative knee MRI: 200+ slices reduced to 16 selected frames before Gemma 4 ever looks at a pixel. ~99% fewer image tokens, entirely on-device.

The final report cites slice labels you can click — "a potential lesion is noted on (Slice 102)" — and clicking jumps the viewer to that exact frame. You're not a passive recipient anymore. You're an active investigator.

Watch it in action

The clearest way to understand the workflow is to see it. The walkthrough below shows the drag-and-drop load, the prompt, the plan preview, the evidence review, and the final clickable report — about three minutes end-to-end.

The 3-minute Dr.MRI.AI walkthrough. Direct link: youtube.com/watch?v=sIceYp5vTQc

The Engineering Underneath the Hood

For my fellow engineers and the terminally curious, this is where the fun begins. Here's how I translated that "detective" concept into code. The entire project is on my GitHub — github.com/rabimba/drmriai — if you want to see the source.

The Body: React, Vite, and a headless Cornerstone3D. The application's skeleton is a modern web stack — React, Vite, and Tailwind CSS. The "eyes" of the operation, the high-performance viewer, are powered by Cornerstone3D, the gold standard for web-based medical imaging. It uses WebGL for GPU-accelerated rendering that lets you scroll through hundreds of images without a stutter.

The Nervous System: Taming the DICOM firehose. When you drag and drop a folder, a lot happens in a Web Worker so the UI never freezes:

  1. Parsing. I use dicom-parser to rip through each file's header without touching the heavy pixel data. This is incredibly fast.
  2. Extraction. I pull out the clues for the AI detective — SeriesInstanceUID, InstanceNumber, ConvolutionKernel, and the ImageOrientationPatient direction cosines that tell me whether a series is axial, sagittal, or coronal. (Series Description is free text. Direction cosines are math, and math doesn't lie.)
  3. Organization. The code groups all slices by their shared series and sorts them by instance number. The result is a single, clean StudyMetadata object the model can actually reason about.



Three Gemma 4 calls, one cached model. The plan from Call 1 is shown to the user before any image is sent to Call 2.

The Brain: A Two-Call AI Architecture (Really, Three)

This is the heart of the project. I split the AI task into a planner, a reviewer, and a synthesizer, all powered by the same Gemma 4 model.

The Brain: three calls, one cached Gemma 4 model, with the SliceExporter pipeline that turns raw DICOM into model-ready JPEGs.



Call 1 : The Planner (text-only). The mission is to create a plan of attack. I send the model a clinical summary built from StudyMetadata and instruct it to return a structured JSON SelectionPlan. No pixels, no waste. This is where a specialized model like MedGemma shines — fine-tuned on medical text, it understands the jargon natively and picks better series.

Call 2 : The Reviewer (multimodal vision). My SliceExporter module now executes the plan. This is a delicate process: it fetches the raw 16-bit pixel data, uses an HTML5 Canvas to apply the correct window/level values (critical for visibility), and encodes a viewable 8-bit JPEG, perfectly prepared for the vision model. Gemma 4 then reviews those JPEGs in small batches with slice-position context.

Call 3 : The Synthesizer (text-only). A final pass merges the batch notes into one cohesive report, with the explicit instruction that every finding must reference a slice label from the batch notes. No invention, no drift.

The haystack on the left. The actual evidence on the right. ~99% fewer image tokens reach the model.

The Pluggable Brain: One Interface, Three Ways to Run Gemma 4

To avoid being locked into one AI provider, I used a classic software design pattern. I defined a simple LLMService interface — three methods, getSelectionPlan, analyzeSlices, sendFollowUp — and wrote three implementations against the same Gemma 4 family of models. Same JSON contract, same multimodal batch format, same Evidence ZIP export. Pick the deployment story that fits your constraints.

1. Gemma 4 Browser (WebGPU)

onnx-community/gemma-4-E2B-it-ONNX via Transformers.js at q4f16. The whole pipeline runs in your browser tab on WebGPU. No server, no API key, no upload. This is the public demo path and what you see in the video.

2. Ollama Local (MedGemma + Gemma 4)

Planner: alibayram/medgemma:4b. Reviewer: gemma4:latest. MedGemma handles the medical-text planning step; multimodal Gemma 4 handles the vision review. Both local, on your machine, via Ollama.

3. OpenAI-Compatible Endpoint

Point Dr.MRI.AI at any /v1 chat completions URL — Google's Gemini OpenAI-compat endpoint, a self-hosted Gemma 4 on vLLM/TGI, Together/Fireworks/Groq, or an internal hospital gateway. Bring your own key, entered at runtime.

That portability is the point. Medical imaging shouldn't be locked to one vendor's hosting story — some users need a fully local deployment, others need an enterprise gateway, and the public demo needs to work for a stranger on the internet with nothing but a browser.

Why Gemma 4 Specifically

Three properties of the Gemma 4 family made this architecture viable.

It's small enough for the browser. Gemma 4 E2B at q4f16 quantization fits in a single-digit-GB cache. The first load on a modest laptop takes a couple of minutes; after that it's cache-hot. This is the model size that makes "frontier intelligence at the edge" stop being a slogan.

One model serves the planner, the reviewer, and the synthesizer. Gemma 4 is multimodal but also a strong text reasoner. I use the same cached weight file for all three calls — text JSON planning, multimodal review, text synthesis. No model swap, no second context window, no cross-model orchestration overhead.

It's an open family with a medical vertical. Because Gemma is open, a self-hosted OpenAI-compatible deployment is reasonable, a fully local Ollama deployment is reasonable, and MedGemma already exists as a domain-tuned planner. Three reproducible deployment paths fall out of one open family.



The three Gemma 4 properties that made the architecture possible.

The "Owner's Manual" I Wish We Had

The result of this obsessive journey is the tool I was desperately searching for on that first day. A tool that respects the complexity of the data and the privacy of the user.

You can try it yourself at rabimba.github.io/drmriai/.

You drag your DICOM folder on, ask your question, and watch as the AI thinks, plans, and analyzes. Then, the report appears. And when it says "a potential lesion is noted on (Slice 102)," you click that blue link, and the main viewer instantly snaps to that exact slice. You're no longer a passive recipient of information. You're an active investigator.

This project was born from frustration, but it was built with a sense of purpose. It's not about replacing doctors, it's about enabling patients to have more informed, confident conversations with their doctors.

It's about turning that locked box of data into something you can finally see, question, and understand. It's the owner's manual we all deserve.

Try it: rabimba.github.io/drmriai/  ·  Source: github.com/rabimba/drmriai  ·  Video: 3-min demo

Educational and research use only. Dr.MRI.AI is not a certified medical device and is not intended for clinical diagnosis or treatment decisions. Always consult a qualified physician for medical advice.

Comments

Popular posts from this blog

Deep Dive into the Google Agent Development Kit (ADK): Features and Code Examples

In our previous overview, we introduced the Google Agent Development Kit (ADK) as a powerful Python framework for building sophisticated AI agents. Now, let's dive deeper into some of the specific features that make ADK a compelling choice for developers looking to create agents that can reason, plan, use tools, and interact effectively with the world. 1. The Core: Configuring the `LlmAgent` The heart of most ADK applications is the LlmAgent (aliased as Agent for convenience). This agent uses a Large Language Model (LLM) for its core reasoning and decision-making. Configuring it effectively is key: name (str): A unique identifier for your agent within the application. model (str | BaseLlm): Specify the LLM to use. You can provide a model name string (like 'gemini-1.5-flash') or an instance of a model class (e.g., Gemini() ). ADK resolves string names using its registry. instruction (str | Callable): This is crucial for guiding the agent's be...

Build Smarter AI Agents Faster: Introducing the Google Agent Development Kit (ADK)

The world is buzzing about AI agents – intelligent entities that can understand goals, make plans, use tools, and interact with the world to get things done. But building truly capable agents that go beyond simple chatbots can be complex. You need to handle Large Language Model (LLM) interactions, manage conversation state, give the agent access to tools (like APIs or code execution), orchestrate complex workflows, and much more. Introducing the Google Agent Development Kit (ADK) , a comprehensive Python framework from Google designed to significantly simplify the process of building, testing, deploying, and managing sophisticated AI agents. Whether you're building a customer service assistant that interacts with your internal APIs, a research agent that can browse the web and summarize findings, or a home automation hub, ADK provides the building blocks you need. Core Concepts: What Makes ADK Tick? ADK is built around several key concepts that make agent development more s...

Curious case of Cisco AnyConnect and WSL2

One thing Covid has taught me is the importance of VPN. Also one other thing COVID has taught me while I work from home  is that your Windows Machine can be brilliant  as long as you have WSL2 configured in it. So imagine my dismay when I realized I cannot access my University resources while being inside the University provided VPN client. Both of the institutions I have affiliation with, requires me to use VPN software which messes up WSL2 configuration (which of course I realized at 1:30 AM). Don't get me wrong, I have faced this multiple times last two years (when I was stuck in India), and mostly I have been lazy and bypassed the actual problem by side-stepping with my not-so-noble  alternatives, which mostly include one of the following: Connect to a physical machine exposed to the internet and do an ssh tunnel from there (not so reliable since this is my actual box sitting at lab desk, also not secure enough) Create a poor man's socks proxy in that same box to have...