We are currently living through the “Post-Reasoning” phase of the AI hype cycle.
By now, models like Gemini 2.0 and the latest iterations of Gemini have normalized the idea that machines can “think”—or at least, simulate a chain of thought that feels indistinguishable from reasoning.
But as we push these architectures to their absolute limits, we are starting to see a plateau. It isn’t a plateau of competence; the models are brilliant. It is a plateau of certainty.
In building applications on top of these models, I’ve noticed a recurring pattern. Developers (myself included) often assume that if a model fails to predict the right outcome, it’s a failure of intelligence. We assume we need a larger parameter count, a longer context window, or better fine-tuning.
But there is a ghost in the machine that scaling laws cannot exorcise. It is the fundamental difference between not knowing and not seeing.
The Architecture of Doubt
To understand why our models, even state-of-the-art ones, hit a wall, we have to look at what they are actually doing. Despite the “Reasoning” labels on the box, modern LLMs are fundamentally probabilistic engines. They estimate a conditional probability distribution:
P(Y | X)
Given a context X (your prompt, a code snippet, a video), what is the most likely target Y?
In the early 2020s, we spent all our energy optimizing the function that maps X to Y. We assumed that if we just made the neural network dense enough, the error rate would drop to zero. But this ignores the statistical reality that error comes from two distinct places:
- Epistemic Uncertainty: The model doesn’t know the answer because it hasn’t seen enough training data or lacks the computational depth to find the pattern. This is solvable. This is what “scaling up” solves.
- Aleatoric Uncertainty: The answer cannot be derived from the input. The data X simply does not contain the information required to resolve Y.
This second category is the silent killer of AI reliability.
The Oracle’s Blindfold
Consider a thought experiment. You give a multimodal model like Gemini Pro a high-resolution image of a poker table and ask it to predict who will win the hand.
The model can identify the cards on the table. It can analyze the players’ facial expressions for micro-expressions (bluffing). It can calculate the pot odds with superhuman precision. It might give you a probability:
“Player A has a 60% chance of winning.”
But if the outcome depends on the hidden cards in the deck, the model hits a hard ceiling. No amount of extra compute, no amount of “System 2 thinking,” and no amount of historical training data will improve that prediction. The information is simply orthogonal to the input sensors.
We call this the Bayes Error Rate—the lowest possible error rate for any classifier on a given outcome.
In 2025, we are guilty of conflating confidence with calibration. We teach our models to sound sure of themselves. If a model predicts a stock movement or a medical diagnosis, it often mimics the assertive tone of the human experts in its training data. But unless the model has access to the causal variables driving the outcome, that confidence is a hallucination of competence.
The Multimodal Trap
The push toward native multimodality was, perhaps unintentionally, the first step toward addressing the “Aleatoric” problem. By allowing a model to ingest video and audio simultaneously with text, we aren’t just giving it more data; we are giving it better data. We are expanding the dimensions of X.
However, we are still treating the input as a fixed variable. In the current paradigm, we feed the model a dataset and ask, “How well can you predict?”
The next leap in AI won’t come from asking the model to predict better. It will come from the model asking for better inputs.
From Prediction to Measurement
If we want to break the current ceiling of predictability, we have to stop treating AI as a brain in a jar and start treating it as part of a sensory system.
In healthcare, for example, we are obsessed with feeding electronic health records into LLMs to predict readmission rates. We might get an AUC of 0.75 and wonder why it won’t go higher. We blame the model architecture.
The reality? The outcome might depend on whether the patient has a supportive spouse at home—a variable that does not exist in the electronic health record. The ceiling is 0.75 because the signal isn’t there.
True intelligence involves recognizing this deficit. A truly intelligent agent shouldn’t just output a probability; it should output a request for measurement. It should say:
“I cannot predict Y with confidence based on X. To reduce uncertainty, I need to measure Z.”
The Future is Active Sensing
As we look toward 2026, the most exciting developments won’t be in the transformer architecture itself. They will be in the integration of these models with active sensing.
- Coding: Instead of just predicting the bug, the IDE inserts a logging statement to capture the missing runtime variable.
- Science: Instead of predicting the protein fold, the system suggests the specific wet-lab assay needed to resolve the ambiguity.
We have spent the last decade building better prediction engines—machines that have developed a deeply intuitive “gestalt” about the world. They are incredible at intuition. But intuition without observation is just guessing.
To lift the ceiling of what is predictable, we don’t need bigger models. We need to expand the observable universe of the data itself.
We need to stop trying to force our models to be oracles, and start designing them to be scientists.
Comments
Post a Comment