upvote
I don't understand this view. How I see it the fundamental bottleneck to AGI is continual learning and backpropagation. Models today are static, and human brains don't learn or adapt themselves with anything close to backpropagation. World models don't solve any of these problems; they are fundamentally the same kind of deep learning architectures we are used to work with. Heck, if you think learning from the world itself is the bottleneck, you can just put a vision-action LLM on a reinforcement learning loop in a robotic/simulated body.
reply
> I don't understand this view. How I see it the fundamental bottleneck to AGI is continual learning and backpropagation. Models today are static, and human brains don't learn or adapt themselves with anything close to backpropagation.

Even with continuous backpropagation and "learning", enriching the training data, so called online-learning, the limitations will not disappear. The LLMs will not be able to conclude things about the world based on fact and deduction. They only consider what is likely from their training data. They will not foresee/anticipate events, that are unlikely or non-existent in their training data, but are bound to happen due to real world circumstances. They are not intelligent in that way.

Whether humans always apply that much effort to conclude these things is another question. The point is, that humans fundamentally are capable of doing that, while LLMs are structurally not.

The problems are structural/architectural. I think it will take another 2-3 major leaps in architectures, before these AI models reach human level general intelligence, if they ever reach it. So far they can "merely" often "fake it" when things are statistically common in their training data.

reply
Humans are notoriously bad at formal logic. The Wason selection task is the classic example: most people fail a simple conditional reasoning problem unless it’s dressed up in familiar social context, like catching cheaters. That looks a lot more like pattern matching than rule application.

Kahneman’s whole framework points the same direction. Most of what people call “reasoning” is fast, associative, pattern-based. The slow, deliberate, step-by-step stuff is effortful and error-prone, and people avoid it when they can. And even when they do engage it, they’re often confabulating a logical-sounding justification for a conclusion they already reached by other means.

So maybe the honest answer is: the gap between what LLMs do and what most humans do most of the time might be smaller than people assume. The story that humans have access to some pure deductive engine and LLMs are just faking it with statistics might be flattering to humans more than it’s accurate.

Where I’d still flag a possible difference is something like adaptability. A person can learn a totally new formal system and start applying its rules, even if clumsily. Whether LLMs can genuinely do that outside their training distribution or just interpolate convincingly is still an open question. But then again, how often do humans actually reason outside their own “training distribution”? Most human insight happens within well-practiced domains.

reply
> The story that humans have access to some pure deductive engine and LLMs are just faking it with statistics might be flattering to humans more than it’s accurate.

Your point rings true with most human reasoning most of the time. Still, at least some humans do have the capability to run that deductive engine, and it seems to be a key part (though not the only part) of scientific and mathematical reasoning. Even informal experimentation and iteration rest on deductive feedback loops.

reply
> The Wason selection task is the classic example: most people fail a simple conditional reasoning problem unless it’s dressed up in familiar social context, like catching cheaters.

I've never heard about the Wason selection task, looked it up, and could tell the right answer right away. But I can also tell you why: because I have some familiarity with formal logic and can, in your words, pattern-match the gotcha that "if x then y" is distinct from "if not x then not y".

In contrast to you, this doesn't make me believe that people are bad at logic or don't really think. It tells me that people are unfamiliar with "gotcha" formalities introduced by logicians that don't match the everyday use of language. If you added a simple additional to the problem, such as "Note that in this context, 'if' only means that...", most people would almost certainly answer it correctly.

Mind you, I'm not arguing that human thinking is necessarily more profound from what what LLMs could ever do. However, judging from the output, LLMs have a tenuous grasp on reality, so I don't think that reductionist arguments along the lines of "humans are just as dumb" are fair. There's a difference that we don't really know how to overcome.

reply
Agree with much of your comment.

Though note that as GP said, on the Wason selection task, people famously do much better when it's framed in a social context. That at least partially undermines your theory that its lack of familiarity with the terminology of formal logic.

reply
Your response contains a performative contradiction: you are asserting that humans are naturally logical while simultaneously committing several logical errors to defend that claim.
reply
This comment would be a lot more useful with an enumeration of those logical errors.
reply
commenter’s specific claim—that adding a note about the definition of "if" would solve the problem—is a moving the goalposts fallacy and a tautology. The comment also suffers from hasty generalization (in their experience the test isn't hard) and special pleading (double standard for LLM and humans).
reply
When someone tells you "you can have this if you pay me", they don't mean "you can also have it if you don't pay". They are implicitly but clearly indicating you gotta pay.

It's as simple as that. In common use, "if x then y" frequently implies "if not x then not y". Pretending that it's some sort of a cognitive defect to interpret it this way is silly.

reply
> Even with continuous backpropagation and "learning"

That's what I said. Backpropagation cannot be enough; that's not how neurons work in the slightest. When you put biological neurons in a Pong environment they learn to play not through some kind of loss or reward function; they self-organize to avoid unpredictable stimulation. As far as I know, no architecture learns in such an unsupervised way.

https://www.sciencedirect.com/science/article/pii/S089662732...

reply
Forgive me for being ignorant - but 'loss' in supervised learning ML context encode the difference between how unlikely (high loss) or likely (low loss) was the network in predicting the output based on the input.

This sounds very similar to me as to what neurons do (avoid unpredictable stimulation)

reply
So, I have been thinking about this for a little while. Image a model f that takes a world x and makes a prediciton y. At a high-level, a traditional supervised model is trained like this

f(x)=y' => loss(y',y) => how good was my prediction? Train f through backprop with that error.

While a model trained with reinforcement learning is more similar to this. Where m(y) is the resulting world state of taking an action y the model predicted.

f(x)=y' => m(y')=z => reward(z) => how good was the state I was in based on my actions? Train f with an algorithm like REINFORCE with the reward, as the world m is a non-differentiable black-box.

While a group of neurons is more like predicting what is the resulting word state of taking my action, g(x,y), and trying to learn by both tuning g and the action taken f(x).

f(x)=y' => m(y')=z => g(x,y)=z' => loss(z,z') => how predictable was the results of my actions? Train g normally with backprop, and train f with an algorithm like REINFORCE with negative surprise as a reward.

After talking with GPT5.2 for a little while, it seems like Curiosity-driven Exploration by Self-supervised Prediction[1] might be an architecture similar to the one I described for neurons? But with the twist that f is rewarded by making the prediction error bigger (not smaller!) as a proxy of "curiosity".

[1] https://arxiv.org/pdf/1705.05363

reply
I think people MOSTLY foresee and anticipate events in OUR training data, which mostly comprises information collected by our senses.

Our training data is a lot more diverse than an LLMs. We also leverage our senses as a carrier for communicating abstract ideas using audio and visual channels that may or may not be grounded in reality. We have TV shows, video games, programming languages and all sorts of rich and interesting things we can engage with that do not reflect our fundamental reality.

Like LLMs, we can hallucinate while we sleep or we can delude ourselves with untethered ideas, but UNLIKE LLMs, we can steer our own learning corpus. We can train ourselves with our own untethered “hallucinations” or we can render them in art and share them with others so they can include it in their training corpus.

Our hallucinations are often just erroneous models of the world. When we render it into something that has aesthetic appeal, we might call it art.

If the hallucination helps us understand some aspect of something, we call it a conjecture or hypothesis.

We live in a rich world filled with rich training data. We don’t magically anticipate events not in our training data, but we’re also not void of creativity (“hallucinations”) either.

Most of us are stochastic parrots most of the time. We’ve only gotten this far because there are so many of us and we’ve been on this earth for many generations.

Most of us are dazzled and instinctively driven to mimic the ideas that a small minority of people “hallucinate”.

There is no shame in mimicking or being a stochastic parrot. These are critical features that helped our ancestors survive.

reply
> We can steer our own learning corpus

This is critical. We have some degree of attentional autonomy. And we have a complex tapestry of algorithms running in thalamocortical circuits that generate “Nows”. Truncation commands produce sequences of acts (token-like products).

reply
> They will not foresee/anticipate events, that are unlikely or non-existent in their training data, but are bound to happen due to real world circumstances. They are not intelligent in that way.

Can you be a bit more specific at all bounds? Maybe via an example?

reply
I'm sure that if a car appeared from nowhere in the middle of your living room, you would not be prepared at all.

So my question is: when is there enough training data that you can handle 99.99% of the world ?

reply
The main difference is humans are learning all the time and models learn batch wise and forget whatever happened in a previous session unless someone makes it part of the training data so there is a massive lag.

Whoever cracks the continuous customized (per user, for instance) learning problem without just extending the context window is going to be making a big splash. And I don't mean cheats and shortcuts, I mean actually tuning the model based on received feedback.

reply
> Models today are static, and human brains don't learn or adapt themselves with anything close to backpropagation.

While I suspect latter is a real problem (because all mammal brains* are much more example-efficient than all ML), the former is more about productisation than a fundamental thing: the models can be continuously updated already, but that makes it hard to deal with regressions. You kinda want an artefact with a version stamp that doesn't change itself before you release the update, especially as this isn't like normal software where specific features can be toggled on or off in isolation of everything else.

* I think. Also, I'm saying "mammal" because of an absence of evidence (to my *totally amateur* skill level) not evidence of absence.

reply
they can be continuously updated, assuming you re-run representative samples of the training set through them continuously. Unlike a mammal brain which preserves the function of neurons unless they activate in a situation which causes a training signal, deep nets have catastrophic forgetting because signals get scattered everywhere. If you had a model continuously learning about you in your pocket, without tons of cycles spent "remembering" old examples. In fact, this is a major stumbling block in standard training, sampling is a huge problem. If you just iterate through the training corpus, you'll have forgotten most of the english stuff by the time you finish with chinese or spanish. You have to constantly mix and balance training info due to this limitation.

The fundamental difference is that physical neurons have a discrete on/off activation, while digital "neurons" in a network are merely continuous differentiable operations. They also don't have a notion of "spike timining dependency" to avoid overwriting activations that weren't related to an outcome. There are things like reward-decay over time, but this applies to the signal at a very coarse level, updates are still scattered to almost the entire system with every training example.

reply
Iirc LeCunn talks about a self organizing hierarchy of real world objects and imo this is exactly how the human brain actually learns
reply
The fact that models aren't continually updating seems more like a feature. I want to know the model is exactly the same as it was the last time I used it. Any new information it needs can be stored in its context window or stored in a file to read the next it needs to access it.
reply
> The fact that models aren't continually updating seems more like a feature.

I think this is true to some extent: we like our tools to be predictable. But we’ve already made one jump by going from deterministic programs to stochastic models. I am sure the moment a self-evolutive AI shows up that clears the "useful enough" threshold we’ll make that jump as well.

reply
Stochastic and unpredictability aren't exactly the same. I would claim current LLMs are generally predictable even if it is not as predictable as a deterministic program.
reply
It’s a feature of a good tool, but a sentient intelligence is more than just a tool
reply
Persistent memory through text in the context window is a hack/workaround.

And generally:

> I want to know the model is exactly the same as it was the last time I used it.

What exactly does that gain you, when the overall behavior is still stochastic?

But still, if it's important to you, you can get the same behavior by taking a model snapshot once we crack continuous learning.

reply
Unless you use your oen local models then you don't even know when OpenAI or Anthropic tweaked the model less or more. One week it's a version x, next week it's a version y. Just like your operating system is continuously evolving with smaller patches of specific apps to whole new kernel version and new OS release.
reply
There is still a huge gap between a model continuously updating itself and weekly patches by a specialist team. The former would make things unpredictable.
reply
You could have continual learning on text and still be stuck in the same "remixing baseline human communications" trap. It's a nasty one, very hard to avoid, possibly even structurally unavoidable.

As for the "just put a vision LLM in a robot body" suggestion: People are trying this (e.g. Physical Intelligence) and it looks like it's extraordinarily hard! The results so far suggest that bolting perception and embodiment onto a language-model core doesn't produce any kind of causal understanding. The architecture behind the integration of sensory streams, persistent object representations, and modeling time and causality is critically important... and that's where world models come in.

reply
I don't understand why online learning is that necessary. If you took Einstein at 40 and surgically removed his hippocampus so he can't learn anything he didn't already know (meaning no online learning), that's still a very useful AGI. A hippocampus is a nice upgrade to that, but not super obviously on the critical path.
reply
> If you took Einstein at 40 and surgically removed his hippocampus so he can't learn anything he didn't already know (meaning no online learning), that's still a very useful AGI.

I like how people are accepting this dubious assertion that Einstein would be "useful" if you surgically removed his hippocampus and engaging with this.

It also calls this Einstein an AGI rather than a disabled human???

reply
Hypotheticals fear him
reply
He basically said that himself:

"Reading, after a certain age, diverts the mind too much from its creative pursuits. Any man who reads too much and uses his own brain too little falls into lazy habits of thinking".

-- Albert Einstein

reply
I guess the sheer amount and also variety of information you would need to pre-encode to get an Einstein at 40 is huge. Every day stream of high resolution video feed and actions and consequences and thoughts and ideas he has had until the age of 40 of every single moment. That includes social interactions, like a conversation and mimic of the other person in combination with what was said and background knowledge about the other person. Even a single conversation's data is a huge amount of data.

But one might say that the brain is not lossless ... True, good point. But in what way is it lossy? Can that be simulated well enough to learn an Einstein? What gives events significance is very subjective.

reply
Kinda a moot point in my eyes because I very much doubt you can arrive at the same result without the same learning process.
reply
That's true. Though could that hippocampus-less Einstein be able to keep making novel complex discoveries from that point forward? Seems difficult. He would rapidly reach the limits of his short term memory (the same way current models rapidly reach the limits of their context windows).
reply
It could possibly be useful but I don't see why it would be AGI.
reply
Where does that training data come from?
reply
Who knows? Perhaps attention really is all you need. Maybe our context window is really large. Or our compression is really effective. Perhaps adding external factors might be able to indirectly teach the models to act more in line with social expectations such as being embarrassed to repeat the same mistake, unlocking the final piece of the puzzle. We are still stumbling in the dark for answers.
reply
Agents have the ability of continual learning.
reply
Putting stuff you have learned into a markdown file is a very "shallow" version of continual learning. It can remember facts, yes, but I doubt a model can master new out-of-distribution tasks this way. If anything, I think that Google's Titans[1] and Hope[2] architectures are more aligned with true continual learning (without being actual continual learning still, which is why they call it "test-time memorization").

[1] https://arxiv.org/pdf/2501.00663

[2] https://arxiv.org/pdf/2512.24695

reply
I have had it master tasks by doing this. The first time it tries to solve an issue it may take a long time, but it documents its findings and how it was able to do it and then it applies that knowledge the next time the task comes up.
reply
The sum of human knowledge is more than enough to come up with innovative ideas and not every field is working directly with the physical world. Still I would say there's enough information in the written history to create virtual simulation of 3d world with all ohysical laws applying (to a certain degree because computation is limited).

What current LLMs lack is inner motivation to create something on their own without being prompted. To think in their free time (whatever that means for batch, on demand processing), to reflect and learn, eventually to self modify.

I have a simple brain, limited knowledge, limited attention span, limited context memory. Yet I create stuff based what I see, read online. Nothing special, sometimes more based on someone else's project, sometimes on my own ideas which I have no doubt aren't that unique among 8 billions of other people. Yet consulting with AI provides me with more ideas applicable to my current vision of what I want to achieve. Sure it's mostly based on generally known (not always known to me) good practices. But my thoughts are the same way, only more limited by what I have slowly learned so far in my life.

reply
> virtual simulation of 3d world

Virtual simulations are not substitutable for the physical world. They are fundamentally different theory problems that have almost no overlap in applicability. You could in principle create a simulation with the same mathematical properties as the physical world but no one has ever done that. I'm not sure if we even know how.

Physical world dynamics are metastable and non-linear at every resolution. The models we do build are created from sparse irregular samples with large error rates; you often have to do complex inference to know if a piece of data even represents something real. All of this largely breaks the assumptions of our tidy sampling theorems in mathematics. The problem of physical world inference has been studied for a couple decades in the defense and mapping industries; we already have a pretty good understanding of why LLM-style AI is uniquely bad at inference in this domain, and it mostly comes down to the architectural inability to represent it.

Grounded estimates of the minimum quantity of training data required to build a reliable model of physical world dynamics, given the above properties, is many exabytes. This data exists, so that is not a problem. The models will be orders of magnitude larger than current LLMs. Even if you solve the computer science and theory problems around representation so that learning and inference is efficient, few people are prepared for the scale of it.

(source: many years doing frontier R&D on these problems)

reply
> You could in principle create a simulation with the same mathematical properties as the physical world but no one has ever done that. I'm not sure if we even know how.

What do you mean by that? Simulating physics is a rich field, which incidentally was one of the main drivers of parallel/super computing before AI came along.

reply
The mapping of the physical world onto a computer representation introduces idiosyncratic measurement issues for every data point. The idiosyncratic bias, errors, and non-repeatability changes dynamically at every point in space and time, so it can be modeled neither globally nor statically. Some idiosyncratic bias exhibits coupling across space and time.

Reconstructing ground truth from these measurements, which is what you really want to train on, is a difficult open inference problem. The idiosyncratic effects induce large changes in the relationships learnable from the data model. Many measurements map to things that aren't real. How badly that non-reality can break your inference is context dependent. Because the samples are sparse and irregular, you have to constantly model the noise floor to make sure there is actually some signal in the synthesized "ground truth".

In simulated physics, there are no idiosyncratic measurement issues. Every data point is deterministic, repeatable, and well-behaved. There is also much less algorithmic information, so learning is simpler. It is a trivial problem by comparison. Using simulations to train physical world models is skipping over all the hard parts.

I've worked in HPC, including physics models. Taking a standard physics simulation and introducing representative idiosyncratic measurement seems difficult. I don't think we've ever built a physics simulation with remotely the quantity and complexity of fine structure this would require.

reply
Is this like some scale-independent version of Heisenberg's Uncertainty Principle?
reply
I guess you need two things to make that happen. First, more specialization among models and an ability to evolve, else you get all instances thinking roughly the same thing, or deer in the headlights where they don't know what of the millions of options they should think about. Second, fewer guardrails; there's only so much you can do by pure thought.

The problem is, idk if we're ready to have millions of distinct, evolving, self-executing models running wild without guardrails. It seems like a contradiction: you can't achieve true cognition from a machine while artificially restricting its boundaries, and you can't lift the boundaries without impacting safety.

reply
Agree. LLMs operate in the domain of language and symbols, but the universe contains much more than that. Humans also learn a great deal from direct phenomenological experience of the world, even without putting those experiences into words. I remember a talk by Yann LeCun where he pointed out that in just the first couple of years of life, a human baby is exposed to orders of magnitude more sensory data (vision, sound, etc.) than what current LLMs are typically trained on. This seems like a major limitation of purely language-based models.
reply
I'm gonna be a cynic and say this is money following money and Yann LeCun is an excellent salesman.

I 100% guarantee that he will not be holding the bag when this fails. Society will be protecting him.

On that proviso I have zero respect for this guy.

reply
Um, why would anyone be "holding the bag" and who needs protecting by society? He's not taking out a loan, he's getting capital investment in a startup. People are gambling that he will do well and make money for them. If they gamble wrong, that's on them. Society won't be doing anything either way because investors in startups that fail don't get anything.
reply
Really? As if not everyone told him the last 10 years, especially Gary Marcus which he ridiculed on Twitter at every occasion and now silently like a dog returning home switches to Gary's position. As if anyone was waiting for this, even 5 years ago this was old news, Tenenbaum is building world models for a long time. People in pop venture capital culture don't seem to know what is going on in research. Makes them easier to milk.
reply
I have a pet peeve with the concept of "a genuinely novel discovery or invention", what do you imagine this to be? Can you point me towards a discovery or invention that was "genuinely novel", ever?

I don't think it makes sense conceptually unless you're literally referring to discovering new physical things like elements or something.

Humans are remixers of ideas. That's all we do all the time. Our thoughts and actions are dictated by our environment and memories; everything must necessarily be built up from pre-existing parts.

reply
W Brian Arthur's book "The Nature of Technology" provides a framework for classifying new technology as elemental vs innovative that I find helpful. For example the Huntley-Mcllroy diff operates on the phenomenon that ordered correspondence survives editing. That was an invention (discovery of a natural phenomenon and a means to harness it). Myers diff improves the performance by exploiting the fact that text changes are sparse. That's innovation. A python app using libdiff, that's engineering. And then you might say in terms of "descendants": invention > innovation > engineering. But it's just a perspective.
reply
Suno is transformer-based; in a way it's a heavily modified LLM.

You can't get Suno to do anything that's not in its training data. It is physically incapable of inventing a new musical genre. No matter how detailed the instructions you give it, and even if you cheat and provide it with actual MP3 examples of what you want it to create, it is impossible.

The same goes for LLMs and invention generally, which is why they've made no important scientific discoveries.

You can learn a lot by playing with Suno.

reply
I don't see how this is an architectural problem though. The problem is that music datasets are highly multimodal, and the training process is relying almost entirely on this dataset instead of incorporating basic musical knowledge to allow it to explore a bit further. That's what happens when computer scientists aim to "upset" a field without consulting with experts in said field.
reply
Genuinely novel discovery or invention?

Einstein’s theory of relativity springs to mind, which is deeply counter-intuitive and relies on the interaction of forces unknowable to our basic Newtonian senses.

There’s an argument that it’s all turtles (someone told him about universes, he read about gravity, etc), but there are novel maths and novel types of math that arise around and for such theories which would indicate an objective positive expansion of understanding and concept volume.

reply
Einstein was heavily inspired by Mach: https://en.wikipedia.org/wiki/Mach%27s_principle
reply
Nah - Poincare & Lorentz did quite a bit of groundwork on relativity and its implications before Einstein put it all together.
reply
Novel things can be incremental. I don't think LLMs can do that either, at least I've never seen one do it.
reply
A few years ago I've made this simple thought experiment to convince myself that LLM's won't achieve superhuman level (in the sense of being better than all human experts):

Imagine that we made an LLM out of all dolphin songs ever recorded, would such LLM ever reach human level intelligence? Obviously and intuitively the answer is NO.

Your comment actually extended this observation for me sparking hope that systems consuming natural world as input might actually avoid this trap, but then I realized that tool use & learning can in fact be all that's needed for singularity while consuming raw data streams most of the time might actually be counterproductive.

reply
Imagine that we made an LLM out of all dolphin songs ever recorded, would such LLM ever reach human level intelligence?

It could potentially reach super-dolphin level intelligence

reply
I mean no offense here, but I really don't like this attitude of "I thought for a bit and came up with something that debunks all of the experts!". It's the same stuff you see with climate denialism, but it seems to be considered okay when it comes to AI. As if the people that spend all day every day for decades have not thought of this.

Dataset limitations have been well understood since the dawn of statistics-based AI, which is why these models are trained on data and RL tasks that are as wide as possible, and are assessed by generalization performance. Most of the experts in ML, even the mathematically trained ones, within the last few years acknowledge that superintelligence (under a more rigorous definition than the one here) is quite possible, even with only the current architectures. This is true even though no senior researcher in the field really wants superintelligence to be possible, hence the dozens of efforts to disprove its potential existence.

reply
Was Alphago's move 37 original?

In the last step of training LLMs, reinforcement learning from verified rewards, LLMs are trained to maximize the probability of solving problems using their own output, depending on a reward signal akin to winning in Go. It's not just imitating human written text.

Fwiw, I agree that world models and some kind of learning from interacting with physical reality, rather than massive amounts of digitized gym environments is likely necessary for a breakthrough for AGI.

reply
The term LLM is confusing your point because VLMs belong to the same bin according to Yann.

Using the term autoregressive models instead might help.

reply
Diffusion models are not autoregressive but have the same limitations
reply
Whether it is text or an image, it is just bits for a computer. A token can represent anything.
reply
Sure, but don't conflate the representation format with the structure of what's being represented.

Everything is bits to a computer, but text training data captures the flattened, after-the-fact residue of baseline human thought: Someone's written description of how something works. (At best!)

A world model would need to capture the underlying causal, spatial, and temporal structure of reality itself -- the thing itself, that which generates those descriptions.

You can tokenize an image just as easily as a sentence, sure, but a pile of images and text won't give you a relation between the system and the world. A world model, in theory, can. I mean, we ought to be sufficient proof of this, in a sense...

reply
It’s worth noting how our human relationship or understanding of our world model changed as our tools to inspect and describe our world advanced.

So when we think about capturing any underlying structure of reality itself, we are constrained by the tools at hand.

The capability of the tool forms the description which grants the level of understanding.

reply
Can a token represent concentration, will?
reply
why LLMs (transformers trained on multimodal token sequences, potentially containing spatiotemporal information) can't be a world model?
reply
I really hate the world model terminology, but the actual low level gripe between LeCunn and autoregressive LLMs as they stand now is the fact that the loss function needs to reconstruct the entirety of the input. Anything less than pixel perfect reconstruction on images is penalized. Token by token reconstruction also is biased towards that same level of granularity.

The density of information in the spatiotemporal world is very very great, and a technique is needed to compress that down effectively. JEPAs are a promising technique towards that direction, but if you're not reconstructing text or images, it's a bit harder for humans to immediately grok whether the model is learning something effectively.

I think that very soon we will see JEPA based language models, but their key domain may very well be in robotics where machines really need to experience and reason about the physical the world differently than a purely text based world.

reply
Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to compress that down), but at the same time LeCunn wouldn't consider that a world model?
reply
VideoGen models have to have decoder output heads that reproduce pixel level frames. The loss function involes producing plausible image frames that requires a lot of detailed reconstruction.

I assume that when you get out of bed in the morning, the first thing you dont do is paint 1000 1080p pictures of what your breakfast looks like.

LeCunns models predict purely in representation space and output no pixel scale detailed frames. Instead you train a model to generate a dower dimension representation of the same thing from different views, penalizing if the representation is different ehen looking at the same thing

reply
https://medium.com/state-of-the-art-technology/world-models-...

> One major critique LeCun raises is that LLMs operate only in the realm of language, which is a simple, discrete space compared to the continuous, complex physical world we live in. LLMs can solve math problems or answer trivia because such tasks reduce to pattern completion on text, but they lack any meaningful grounding in physical reality. LeCun points out a striking paradox: we now have language models that can pass the bar exam, solve equations, and compute integrals, yet “where is our domestic robot? Where is a robot that’s as good as a cat in the physical world?” Even a house cat effortlessly navigates the 3D world and manipulates objects — abilities that current AI notably lacks. As LeCun observes, “We don’t think the tasks that a cat can accomplish are smart, but in fact, they are.”

reply
But they don't only operate on language? They operate on token sequences, which can be images, coordinates, time, language, etc.
reply
It’s an interesting observation, but I think you have it backwards. The examples you give are all using discrete symbols to represent something real and communicating this description to other entities. I would argue that all your examples are languages.
reply
Whats the first L stand for? Thats not just vestogial, their model of the world is formed almost exclusively from language rather than a range of things contributing significantly like for humans.

The biggest thing thats missing is actual feedback to their decisions. They have no "idea of that because transformers and embeddings dont model that yet. And langiage descriptions and image representations of feedback arent enough. They are too disjointed. It needs more

reply
How is a Linear stream of symbols able to capture the relationships of a real world?

It's like the people who are so hyped up about voice controlled computers. Like you get a linear stream of symbols is a huge downgrade in signals, right? I don't want computer interaction to be yet more simplified and worsened.

Compare with domain experts who do real, complicated work with computers, like animators, 3D modelers, CAD, etc. A mouse with six degrees of freedom, and a strong training in hotkeys to command actions and modes, and a good mental model of how everything is working, and these people are dramatically more productive at manipulating data than anyone else.

Imagine trying to talk a computer through nudging a bunch of vertexes through 3D space while flexibly managing modes of "drag" on connected vertexes. It would be terrible. And no, you would not replace that with a sentence of "Bot, I want you to nudge out the elbow of that model" because that does NOT do the same thing at all. An expert being able to fluidly make their idea reality in real time is just not even remotely close to the instead "Project Manager/mediocre implementer" relationship you get prompting any sort of generative model. The models aren't even built to contain specific "Style", so they certainly won't be opinionated enough to have artistic vision, and a strong understanding of what does and does not work in the right context, or how to navigate "My boss wants something stupid that doesn't work and he's a dumb person so how do I convince him to stop the dumb idea and make him think that was his idea?"

reply
>We don’t think the tasks that a cat can accomplish are smart, but in fact, they are.

https://en.wikipedia.org/wiki/Moravec%27s_paradox

All the things we look at as "Smart" seem to be the things we struggle with, not what is objectively difficult, if that can even be defined.

reply
There will be no "unlocking of AGI" until we develop a new science capable of artificial comprehension. Comprehension is the cornucopia that produces everything we are, given raw stimulus an entire communicating Universe is generated with a plethora of highly advanceds predator/prey characters in an infinitely complex dynamic, and human science and technology have no lead how to artificially make sense of that in a simultaneous unifying whole. That's comprehension.
reply
Ironically, your comment is practically incomprehensible.
reply
These two comments above me capture Slashdot in the early 2000s.
reply
A lot more justifiable than say, Thinking Machines at least. But we will "see".

World models and vision seems like a great use case for robotics which I can imagine that being the main driver of AMI.

reply
> LLMs are fundamentally capped because they only learn from static text -- human communications about the world -- rather than from the world itself, which is why they can remix existing ideas but find it all but impossible to produce genuinely novel discoveries or inventions.

No hate, but this is just your opinion.

The definition of "text" here is extremely broad – an SVG is text, but it's also an image format. It's not incomprehensible to imagine how an AI model trained on lots of SVG "text" might build internal models to help it "visualise" SVGs in the same way you might visualise objects in your mind when you read a description of them.

The human brain only has electrical signals for IO, yet we can learn and reason about the world just fine. I don't see why the same wouldn't be possible with textual IO.

reply
Yeah I don't even think you'd need to train it. You could probably just explain how SVG works (or just tell it to emit coordinates of lines it wants to draw), and tell it to draw a horse, and I have to imagine it would be able to do so, even if it had never been trained on images, svg, or even cartesian coordinates. I think there's enough world model in there that you could simply explain cartesian coordinates in the context, it'd figure out how those map to its understanding of a horse's composition, and output something roughly correct. It'd be an interesting experiment anyway.

But yeah, I can't imagine that LLMs don't already have a world model in there. They have to. The internet's corpus of text may not contain enough detail to allow a LLM to differentiate between similar-looking celebrities, but it's plenty of information to allow it to create a world model of how we perceive the world. And it's a vastly more information-dense means of doing so.

reply
Honestly, how do people who know so little have this much confidence to post here?
reply
Care to explain what led to this reaction?
reply
You must be new here
reply