upvote
640gb ought to be enough for anybody
reply
Squeezing a model like this complete with 'big model smell' into 16GB...Honestly it's not even possible or feasibly possible today.

It'll require some kind of:

- breakthrough in architecture or

- breakthrough in hardware or

- some breakthrough quantisization technique

The problem is that all the parameters need to be in memory, even the ones that aren't active (say for Mixture Of Expert Models) because switching parametrs in and out of ram is far too slow.

reply
"That’s where EMO comes in.

We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."

https://allenai.org/blog/emo

reply
The people working at the leading edge of this stuff seem to believe that there is a need for parallel models that solve different problems.

A crow exhibits some degree of intelligence in what is a very small brain compared to humans. There is overlap in the problem solving skills of the dumbest humans and the smartest crows.

So the question is: what is that? Yann LeCun seems to think it’s what we now call world models. World models predict behaviour as opposed to predicting structured data (like language.)

If your model can predict how some world works (how you define world largely depends on the size of your training data), then in theory it is able to reason about cause and effect.

If you can combine cause and effect reasoning with language, you might get something truly intelligent.

That’s where things seem to be going. Once we have a prototype of that system, there will be many questions about how much data you really need. We’ve seen how even shrinking LLMs with 1-bit quantization can lead to models that exhibit a fairly strong understanding of language.

I don’t think it’s unreasonable to expect to see some very intelligent low (relatively) memory AI systems in the next couple years.

reply