By moving away from generative models toward joint embedding predictive architectures, AI pioneer Yann LeCun is betting that the path to true intelligence lies in world models, not just next-token prediction.
Beyond the Generative Horizon
In the current AI landscape, Large Language Models (LLMs) reign supreme. These systems operate on a simple yet powerful principle: next-token prediction. By training on trillions of words, they learn to predict the most likely next piece of text, a process that has yielded surprisingly sophisticated results in reasoning and creative writing. However, AI legend Yann LeCun argues that this generative path is reaching a plateau. While LLMs are masters of language, they are essentially untethered from the physical world. They lack common sense, the ability to plan, and a fundamental understanding of cause and effect.
LeCun’s alternative is not a single model but a framework called JEPA: Joint Embedding Predictive Architecture. Unlike GPT-4 or Sora, JEPA is not designed to spit out text, images, or video. It is designed to learn. By moving away from the generative requirement—the need to reconstruct every pixel or word—LeCun believes we can build systems that understand the world the way animals and humans do. This is a billion-dollar bet that the future of AI lies in world models rather than increasingly massive autocomplete engines.
The Problem with Pixels
To understand why LeCun is moving away from generative architectures, one must look at the failure of these models in the realm of video. In language, the vocabulary is discrete and finite; a model only has to choose from about 50,000 possible tokens. In video, the possibilities are infinite. A single frame of high-definition video contains millions of pixels, each capable of taking on millions of color combinations. Predicting the next frame of a video is mathematically daunting because the number of possible futures is larger than the number of atoms in the observable universe.
When generative models attempt to predict the next frame of a video, they often produce a 'blurry' mess. This happens because the model encounters uncertainty. If a ball could bounce left or right, a generative model forced to predict a single set of pixels will often average the two outcomes, resulting in a washed-out, ghostly image. This blurriness compounds over time, making long-term prediction impossible. LeCun realized that for an AI to understand a scene, it doesn't need to predict every flickering leaf on a tree; it needs to understand the underlying structure of the environment.
The Epiphany of Joint Embeddings
The solution lies in 'joint embedding' architectures, a concept LeCun has explored since his days at Bell Labs in the 1990s. Instead of predicting raw data (Y) from input (X), both the input and the target are passed through encoders that turn them into abstract mathematical vectors called embeddings. The system then tries to predict the embedding of the future state from the embedding of the current state. This allows the model to ignore irrelevant details—like the exact position of every pixel in a cloud—and focus on the semantic meaning of the scene.
For years, this approach was plagued by 'representation collapse.' Because the model is rewarded for making the embeddings of two related images similar, it often finds a 'cheat code': it outputs the same constant vector for every single input. If every image results in a vector of all ones, the similarity is technically maximized, but the model has learned nothing. This hurdle kept non-generative models in the shadow of their generative counterparts until a breakthrough inspired by 1960s neuroscience provided a way forward.
The Barlo Twins Breakthrough
The breakthrough came from the 'Barlo twins' approach, named after neuroscientist Horace Barlo. Barlo hypothesized that biological neurons operate by reducing redundancy. Applying this to AI, LeCun’s team developed a loss function that forced the model’s output neurons to be as informative and non-redundant as possible. By demanding that different neurons capture different features of the data, they effectively blocked the path to representation collapse. This allowed the model to learn deep, rich internal representations of images without ever being told what those images contained.
The results were staggering. When these self-supervised models were tested on image classification, they began to rival and eventually surpass models trained with human-labeled data. By 2024, models like Dino V3 demonstrated that a joint embedding architecture could achieve state-of-the-art accuracy on massive datasets like ImageNet. More importantly, these models showed an uncanny ability to segment objects—distinguishing a hand from a background or a cat from a rug—entirely on their own, simply by learning the structure of visual data.
Building the World Model
The ultimate goal of this research is the creation of an autonomous world model. LeCun points out that a teenager can learn to drive a car in about 20 hours, whereas our best AI systems require millions of hours of data and still struggle with edge cases. The difference is that the teenager has a world model; they can imagine the consequences of turning the wheel too sharply without actually having to crash the car to learn. They have 'common sense,' which LeCun defines as a collection of models that tell an agent what is likely, plausible, or impossible.
In the JEPA framework, the world model is a predictor that sits between embeddings. It can be conditioned on actions, allowing the AI to ask, 'If I send this signal to the robot arm, what will the world look like in the next second?' By simulating these sequences of actions internally, the AI can plan an optimal path to a goal. This is a return to the principles of classical optimal control from the 1950s, but with a modern twist: the model of the world is not programmed by humans, but learned from the data itself.
The Future of Agentic AI
LeCun’s critique of the current 'agentic' AI trend is sharp: you cannot build a truly reliable agent without the ability to predict consequences. Current LLMs are 'auto-regressive,' meaning they simply react to the last thing they said. They do not have a workspace where they can simulate a plan and check it for safety or efficiency before acting. As LeCun puts it, they take the action first and let the 'deluge' follow. To move toward Level 5 autonomy in robotics or truly helpful digital assistants, we need systems that can reason through a search process rather than just a prediction process.
While LLMs will likely remain the masters of the 'language substrate,' the path to human-level intelligence requires a broader understanding of the physical world. The bet is that JEPA and its descendants will provide the foundation for this understanding. By focusing on the 'bulk of the cake'—self-supervised learning of world models—rather than just the 'icing' of human-labeled data, we may finally bridge the gap between machines that can talk and machines that can truly think.