It started with two envelopes.
One contained a single sheet of paper, a radiologist's report for my partner. It was a wall of text that might as well have been written in another language. Words like "parenchymal volume," "hyperintensities," and "susceptibility artifact" stared back at us, creating more anxiety than they resolved.
The other was a flimsy paper sleeve containing a CD-ROM. This, we were told, held the actual images from her MRI scan. The ground truth. And we couldn't even look at it.
Our laptops, like most these days, don't have disc drives. For a moment, this crucial, deeply personal piece of her health information was a coaster. I felt that familiar, hot-wired frustration every engineer knows: the feeling of being locked out by a dumb problem. The powerlessness was infuriating.
So, I did what any slightly obsessive software engineer would do. I went on Amazon and, with a grim sense of purpose, ordered a $20 external DVD player. It felt like an analog solution to a digital problem, and it was the first step down a rabbit hole that would consume me for months.
The Firehose and the Flipbook
When the drive arrived, I plugged it in and was greeted not by JPEGs, but by hundreds of .dcm files. I had just stumbled into the world of DICOM, the universal, arcane language of medical imaging.
After finding a free viewer, I finally saw the "scan." It wasn't one picture. It was a dataset. Over 500 individual slices. Flicking through them felt like watching a grainy, black-and-white flipbook of my partner's brain.
This wasn't clarity. This was a data firehose. How could anyone possibly find a tiny anomaly in this sea of gray?
I realized that a radiologist's superpower isn't just their ability to see. It's their ability to ignore. They use the clinical context to know where to look, which of the dozen-plus series to focus on, and how to tune their eyes. They aren't human search engines; they are detectives following leads.
And I thought, "That's it. That's what an AI needs to do. It needs to be a detective."
Dr.MRI.AI — the project that grew out of that hospital corridor.
What I built
Dr.MRI.AI is a privacy-first DICOM viewer that uses Gemma 4 to choose the evidence before reviewing it. You drag your folder onto the page, type a clinical question in plain English, and the model returns a structured plan — which series, which slice range, which window/level — before it ever looks at a pixel. The plan is shown to you. You can accept it, edit it, or reject it. Only then does the multimodal review run, on the focused subset.
The final report cites slice labels you can click — "a potential lesion is noted on (Slice 102)" — and clicking jumps the viewer to that exact frame. You're not a passive recipient anymore. You're an active investigator.
Watch it in action
The clearest way to understand the workflow is to see it. The walkthrough below shows the drag-and-drop load, the prompt, the plan preview, the evidence review, and the final clickable report — about three minutes end-to-end.
The 3-minute Dr.MRI.AI walkthrough. Direct link: youtube.com/watch?v=sIceYp5vTQc
The Engineering Underneath the Hood
For my fellow engineers and the terminally curious, this is where the fun begins. Here's how I translated that "detective" concept into code. The entire project is on my GitHub — github.com/rabimba/drmriai — if you want to see the source.
The Body: React, Vite, and a headless Cornerstone3D. The application's skeleton is a modern web stack — React, Vite, and Tailwind CSS. The "eyes" of the operation, the high-performance viewer, are powered by Cornerstone3D, the gold standard for web-based medical imaging. It uses WebGL for GPU-accelerated rendering that lets you scroll through hundreds of images without a stutter.
The Nervous System: Taming the DICOM firehose. When you drag and drop a folder, a lot happens in a Web Worker so the UI never freezes:
- Parsing. I use
dicom-parserto rip through each file's header without touching the heavy pixel data. This is incredibly fast. - Extraction. I pull out the clues for the AI detective —
SeriesInstanceUID,InstanceNumber,ConvolutionKernel, and theImageOrientationPatientdirection cosines that tell me whether a series is axial, sagittal, or coronal. (Series Description is free text. Direction cosines are math, and math doesn't lie.) - Organization. The code groups all slices by their shared series and sorts them by instance number. The result is a single, clean
StudyMetadataobject the model can actually reason about.
Three Gemma 4 calls, one cached model. The plan from Call 1 is shown to the user before any image is sent to Call 2.
The Brain: A Two-Call AI Architecture (Really, Three)
This is the heart of the project. I split the AI task into a planner, a reviewer, and a synthesizer, all powered by the same Gemma 4 model.
The Brain: three calls, one cached Gemma 4 model, with the SliceExporter pipeline that turns raw DICOM into model-ready JPEGs.
Call 1 : The Planner (text-only). The mission is to create a plan of attack. I send the model a clinical summary built from StudyMetadata and instruct it to return a structured JSON SelectionPlan. No pixels, no waste. This is where a specialized model like MedGemma shines — fine-tuned on medical text, it understands the jargon natively and picks better series.
Call 2 : The Reviewer (multimodal vision). My SliceExporter module now executes the plan. This is a delicate process: it fetches the raw 16-bit pixel data, uses an HTML5 Canvas to apply the correct window/level values (critical for visibility), and encodes a viewable 8-bit JPEG, perfectly prepared for the vision model. Gemma 4 then reviews those JPEGs in small batches with slice-position context.
Call 3 : The Synthesizer (text-only). A final pass merges the batch notes into one cohesive report, with the explicit instruction that every finding must reference a slice label from the batch notes. No invention, no drift.
The haystack on the left. The actual evidence on the right. ~99% fewer image tokens reach the model.
The Pluggable Brain: One Interface, Three Ways to Run Gemma 4
To avoid being locked into one AI provider, I used a classic software design pattern. I defined a simple LLMService interface — three methods, getSelectionPlan, analyzeSlices, sendFollowUp — and wrote three implementations against the same Gemma 4 family of models. Same JSON contract, same multimodal batch format, same Evidence ZIP export. Pick the deployment story that fits your constraints.
1. Gemma 4 Browser (WebGPU)
onnx-community/gemma-4-E2B-it-ONNX via Transformers.js at q4f16. The whole pipeline runs in your browser tab on WebGPU. No server, no API key, no upload. This is the public demo path and what you see in the video.
2. Ollama Local (MedGemma + Gemma 4)
Planner: alibayram/medgemma:4b. Reviewer: gemma4:latest. MedGemma handles the medical-text planning step; multimodal Gemma 4 handles the vision review. Both local, on your machine, via Ollama.
3. OpenAI-Compatible Endpoint
Point Dr.MRI.AI at any /v1 chat completions URL — Google's Gemini OpenAI-compat endpoint, a self-hosted Gemma 4 on vLLM/TGI, Together/Fireworks/Groq, or an internal hospital gateway. Bring your own key, entered at runtime.
That portability is the point. Medical imaging shouldn't be locked to one vendor's hosting story — some users need a fully local deployment, others need an enterprise gateway, and the public demo needs to work for a stranger on the internet with nothing but a browser.
Why Gemma 4 Specifically
Three properties of the Gemma 4 family made this architecture viable.
It's small enough for the browser. Gemma 4 E2B at q4f16 quantization fits in a single-digit-GB cache. The first load on a modest laptop takes a couple of minutes; after that it's cache-hot. This is the model size that makes "frontier intelligence at the edge" stop being a slogan.
One model serves the planner, the reviewer, and the synthesizer. Gemma 4 is multimodal but also a strong text reasoner. I use the same cached weight file for all three calls — text JSON planning, multimodal review, text synthesis. No model swap, no second context window, no cross-model orchestration overhead.
It's an open family with a medical vertical. Because Gemma is open, a self-hosted OpenAI-compatible deployment is reasonable, a fully local Ollama deployment is reasonable, and MedGemma already exists as a domain-tuned planner. Three reproducible deployment paths fall out of one open family.
The three Gemma 4 properties that made the architecture possible.
The "Owner's Manual" I Wish We Had
The result of this obsessive journey is the tool I was desperately searching for on that first day. A tool that respects the complexity of the data and the privacy of the user.
You can try it yourself at rabimba.github.io/drmriai/.
You drag your DICOM folder on, ask your question, and watch as the AI thinks, plans, and analyzes. Then, the report appears. And when it says "a potential lesion is noted on (Slice 102)," you click that blue link, and the main viewer instantly snaps to that exact slice. You're no longer a passive recipient of information. You're an active investigator.
This project was born from frustration, but it was built with a sense of purpose. It's not about replacing doctors, it's about enabling patients to have more informed, confident conversations with their doctors.
It's about turning that locked box of data into something you can finally see, question, and understand. It's the owner's manual we all deserve.
Try it: rabimba.github.io/drmriai/ · Source: github.com/rabimba/drmriai · Video: 3-min demo
Comments
Post a Comment