Fine-Tuning at the Redline: Architecting a Trustable AI Racing Coach

"AI in a Jupyter notebook is safe. AI on a race track—at 150 mph—is a different story." - Ajeet Mirwani

When we accepted the High-Velocity AI Field Test, the challenge was clear: Build an AI system that could coach a driver in real-time. But in motorsport, "real-time" doesn't mean a fast web request (500ms). It means sub-50ms. At 150 mph, a half-second of latency puts you 110 feet further down the track, the difference between hitting the apex and hitting the wall.

We quickly realized we couldn't rely on the cloud for everything. We needed a "Split-Brain" architecture: Gemini Flash 3.0 in the cloud for high-level strategy, and a fine-tuned Gemma designed for immediate, reflex-based coaching.

While we haven't put the model in the driver's seat just yet, we have successfully completed the critical first phase: engineering the "brain" capable of understanding high-speed telemetry. Here is the deep technical dive into how we fine-tuned Gemma for the racetrack.

1. Synthesizing the "Golden Lap"

The first hurdle was the data. LLMs speak English; cars speak physics (CanBus, OBDII, GPS). We needed to bridge this gap to create a "Trustable" system.

We didn't just ask the model to "guess" the driving line. We used a Human-in-the-Loop approach. We captured telemetry from a pro driver (the "Golden Lap") and synchronized it with novice telemetry.

The Telemetry-to-Token Pipeline

We treated vehicle state as a specialized language. Instead of feeding raw JSON, we quantized continuous telemetry variables into discrete semantic tokens. This reduces the context window and improves the model's pattern matching.

We defined a sliding window of state $S_t$ representing the difference ($\Delta$) between the Novice ($N$) and the Pro ($P$):

$$S_t = \{ \Delta v, \Delta lat, \Delta long, \text{Sector}_{ID} \}$$

Where:

$\Delta v$: Speed delta ($v_{novice} - v_{pro}$)
$\Delta lat$: Lateral G-force difference (Cornering intensity)
$\Delta long$: Longitudinal G-force (Braking/Acceleration efficiency)

We synthesized 20,000 instruction-response pairs by programmatically mapping these deltas to expert coaching commands. This synthetic data generation allowed us to align the model's latent space with the deterministic physics of the track.

2. The Fine-Tuning Recipe: QLoRA for the Domain

We chose Gemma because its architecture fits comfortably within the VRAM constraints of consumer hardware (and eventually, edge devices) while offering strong reasoning capabilities.

Training Configuration:

Base Model: Gemma (Instruction Tuned)
Method: QLoRA (Quantized Low-Rank Adaptation)
Rank (r): 16 (To capture specific domain features without overfitting)
Target Modules: q_proj, k_proj, v_proj, o_proj
Precision: bf16 (Brain Float 16) for stability.

The Model:
You can view and test our fine-tuned model on HuggingFace here: https://huggingface.co/rabimba/gemma2racer

The "Safety Constraint" Loss Function

Standard fine-tuning optimizes for token probability. For a racing coach, we need to optimize for safety. During training, we focused on penalizing "hallucinated actions." If the telemetry indicated a braking zone (high negative longitudinal G requirement), the model had to learn that "ACCELERATE" was not just a wrong token, but a dangerous one.

3. The Future Goal: Deployment on the Edge

While our fine-tuned model is showing promising results in evaluation, the "final mile"—running this inside the car without an internet connection—is our next major milestone.

Training is easy; inference at 150mph is hard. Our roadmap for the live deployment involves pushing this model to the browser using MediaPipe and WebGPU.

The Targeted Architecture:

To achieve the safety requirements of motorsport, we are architecting a "vibe-coded" link between sensors and the model that we call the Antigravity Pipeline:

Quantization to Int4: We plan to compress the fine-tuned adapter and merge it with the base model, converting weights to 4-bit integers. We estimate this will reduce the model size drastically, allowing for instant loading in Chrome on an Android automotive unit.
The Inference Loop:
- Ingest: 10Hz GPS/OBDII data via Serial Web API.
- Tokenize: JavaScript function converts raw floats to our custom <tokens>.
- Inference: llmInference.generateResponse(tokens) running directly on the device GPU.
- Guardrails: A deterministic regex layer to parse output and drop any packet that doesn't match the <action>...</action> format.

The Latency Target:

Current cloud inference (Vertex AI) hovers between 450ms - 800ms, which is unusable for braking zones.
Our target for the edge deployment (Gemma Int4) is ~38ms per token.

This speed would allow the "Reflex" agent to give audio cues ("Brake!", "Turn in!") instantly, effectively creating a co-pilot that thinks as fast as you drive.

4. Why This Matters: Trustable AI

The "Trustable" part is the foundation of our architecture.

Deterministic Fallback: If the model’s confidence score drops, the system defaults to hard-coded safety rules.
Visual Grounding: We don't just trust the text. Our validation UI overlays the model's "thought process" on the video feed, allowing human race engineers to verify the AI's logic.

By fine-tuning Gemma for this specific, high-velocity domain, we are proving that Small Language Models (SLMs) are the key to bringing AI into the physical, real-time world.

RK's Rambling

Search This Blog