Before we start: To be transparent, I want to acknowledge the strong counter-argument to the thesis below: the idea that “scale is all you need” also called the “Scaling Hypothesis”.
Many researchers argue that we don’t need explicit world model architectures, believing instead that an understanding of physics and causality will emerge implicitly if we simply keep scaling Transformers with enough multimodal data. They point to recent reasoning models that appear to simulate future states purely through advanced pattern matching, suggesting that with enough compute, the boundary between predicting tokens and understanding the world might eventually dissolve.
It is a valid perspective, but the article below explains why I remain skeptical that scaling alone will be enough to bridge the gap between describing the world and truly understanding its dynamics.
Alright, lets go 🚀!
Over the last few years, I’ve spent a lot of time observing how large language models are being pushed into more and more roles: reasoning engines, agents, copilots, even autonomous systems. The progress has been impressive… sometimes genuinely surprising.
But the more I look at how these systems behave in practice, the more convinced I become that we’re approaching a ceiling. Not because LLMs are failing, but because they are being asked to do something they were never designed for.
From my perspective, world models are the next inevitable step in AI research after LLMs.
What I’m Seeing in Today’s Systems
LLMs are extraordinarily good at working within the space they were trained on. They recognize patterns, generalize across domains, and approximate reasoning remarkably well. Yet when I watch them interact with real environments (whether physical systems, simulations, or long-running processes) the same limitations show up again and again.
They are reactive.
They respond to prompts, generate plans, and call tools, but they do not anticipate outcomes in a meaningful way. When something goes wrong, the usual fix is to add more scaffolding: longer prompts, memory layers, planning loops, or external evaluators.
To me, this feels like compensating for a missing internal capability rather than extending an existing one.
The Missing Piece: An Internal Model of the World
What I believe is missing is an internal representation of how the world evolves over time.
A world model gives a system the ability to internally simulate: “If I take this action, what is likely to happen next? And what happens after that?”
This is not about hard-coded rules or symbolic simulators. It’s about learned, latent models of dynamics: models that capture causality, temporal structure, and uncertainty.
When I look at the most robust systems in robotics, reinforcement learning, and autonomous decision-making, a clear pattern emerges: the strongest systems reason inside a model of the world before they act in it.
Why Scaling Language Alone Feels Insufficient
I’m not convinced that simply scaling LLMs further will solve this problem.
Language data encodes descriptions of the world, but not the world’s mechanics. You can learn how people talk about physics without learning physics itself. You can learn how plans are written without understanding which ones are feasible.
This shows up most clearly in:
- long-horizon planning
- physical interaction
- and situations that require causal reasoning under uncertainty
At some point, more parameters just mean more fluent guesses. What’s missing is grounding!
Signals from Research I Pay Attention To
What reinforces my conviction is where serious research efforts are going.
DeepMind’s work on MuZero, Dreamer, and more recent generative environment models follows a consistent logic: learn the world first, then plan within it. In robotics, model-based approaches keep resurfacing because purely reactive policies struggle with safety, sample efficiency, and generalization.
Even discussions around embodied AI and autonomous agents are slowly converging on the same insight: without a world model, intelligence remains shallow and brittle.
This doesn’t feel like a trend. It feels like convergence.
From Agents That React to Systems That Anticipate
When I look at current “agentic” systems, I see impressive demos… but also fragility.
An agent without a world model can execute steps, but it cannot reliably reason about long-term consequences. It can optimize locally, but it struggles with global trade-offs. It acts first and corrects later.
World models change this dynamic entirely. They enable systems to deliberate, to simulate futures, and to choose actions based on anticipated outcomes rather than immediate feedback.
To me, that shift from reaction to anticipation is a defining characteristic of the next generation of AI systems.
Why This Matters Beyond Theory
I don’t see world models as an abstract research direction. I see them as a prerequisite for deploying AI into environments that actually matter.
As soon as AI systems:
- control physical devices (robots, UAVs and so on)
- interact with economic systems
- or operate autonomously over long time horizons
they need the ability to reason about consequences before acting. Trial-and-error in the real world doesn’t scale.
World models make that possible.
My Takeaway
LLMs changed how machines represent knowledge.
World models will change how machines understand change!
From what I observe, the future of AI research is less about making models larger and more about making them situated: embedded in a learned understanding of how the world behaves.
That’s why I believe world models are not just the next step after LLMs.
They are the foundation for AI systems that can truly operate in the real world.
