It'll require some kind of:
- breakthrough in architecture or
- breakthrough in hardware or
- some breakthrough quantisization technique
The problem is that all the parameters need to be in memory, even the ones that aren't active (say for Mixture Of Expert Models) because switching parametrs in and out of ram is far too slow.
We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."
A crow exhibits some degree of intelligence in what is a very small brain compared to humans. There is overlap in the problem solving skills of the dumbest humans and the smartest crows.
So the question is: what is that? Yann LeCun seems to think it’s what we now call world models. World models predict behaviour as opposed to predicting structured data (like language.)
If your model can predict how some world works (how you define world largely depends on the size of your training data), then in theory it is able to reason about cause and effect.
If you can combine cause and effect reasoning with language, you might get something truly intelligent.
That’s where things seem to be going. Once we have a prototype of that system, there will be many questions about how much data you really need. We’ve seen how even shrinking LLMs with 1-bit quantization can lead to models that exhibit a fairly strong understanding of language.
I don’t think it’s unreasonable to expect to see some very intelligent low (relatively) memory AI systems in the next couple years.