upvote
Obviously, there’s a limit to how much you can squeeze into a single parameter. I guess the low-hanging fruit will be picked up soon, and scaling will continue with algorithmic improvements in training, like [1], to keep the training compute feasible.

I take "you can't have human-level intelligence without roughly the same number of parameters (hundreds of trillions)" as a null hypothesis: true until proven otherwise.

[1] https://arxiv.org/html/2602.15322v1

reply
Why don't we need them? If I need to run a hundred small models to get a given level of quality, what's the difference to me between that and running one large model?
reply
You can run smaller models on smaller compute hardware and split the compute. For large models you need to be able to fit the whole model in memory to get any decent throughput.
reply
Ah interesting, I didn't realize MoE doesn't need to all run in the same place.
reply
>If I need to run a hundred

It's unfair to take some high number that reflects either disagreement, or assumes that size-equality has a meaning.

> level of quality

What is quality, though? What is high quality, though? Do MY FELLOW HUMANS really know what "quality" is comprised of? Do I hear someone yell "QUALITY IS SUBJECTIVE" from the cheap seats?

I'll explain.

You might care about accuracy (repetition of learned/given text) more than about actual cognitive abilities (clothesline/12 shirts/how long to dry).

From my perspective, the ability to repeat given/learned text has nothing to do with "high quality". Any idiot can do that.

Here's a simple example:

Stupid doctors exist. Plentifully so, even. Every doctor can pattern-match symptoms to medication or further tests, but not every doctor is capable of recognizing when two seemingly different symptoms are actually connected. (simple example: a stiff neck caused by sinus issues)

There is not one person on the planet, who wouldn't prefer a doctor who is deeply considerate of the complexities and feedback-loops of the human body, over a doctor who is simply not smart enough to do so and, thus, can't. He can learn texts all he wants, but the memorization of text does not require deeper understanding.

There are plenty of benefits for running multiple models in parallel. A big one is specialization and caching. Another is context expansion. Context expansion is what "reasoning" models can be observed doing, when they support themselves with their very own feedback loop.

One does not need "hundred" small models to achieve whatever you might consider worthy of being called "quality". All these models can not only reason independently of each other, but also interact contextually, expanding each other's contexts around what actually matters.

They also don't need to learn all the information about "everything", like big models do. It's simply not necessary anymore. We have very capable systems for retrieving information and feeding them to model with gigantic context windows, if needed. We can create purpose-built models. Density/parameter is always increasing.

Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts, can disseminate multiple perspectives on a matter in parallel, boosting context expansion massively.

A single model cannot refactor its own path of thoughts during an inference run, thus massively increasing inefficiency. A single model can only provide itself with feedback one after another, while multiple models can do it all in parallel.

See ... there's two things which cover the above fundamentally:

1. No matter how you put it, we've learned that models are "smarter" when there is at least one feedback-loop involved.

2. No matter how you put it, you can always have yet another model process the output of a previously run model.

These two things, in combination, strongly indicate that multiple small, high-efficiency models running in parallel, providing themselves with the independent feedback they require to actually expand contexts in depth, is the way to go.

Or, in other words:

Big models scale Parameters, many small models scale Insight.

reply
> There is not one person on the planet, who wouldn't prefer a doctor who is deeply considerate of the complexities and feedback-loops of the human body, over a doctor who is simply not smart enough to do so and, thus, can't. He can learn texts all he wants, but the memorization of text does not require deeper understanding.

But a smart person who hasn’t read all the texts won’t be a good doctor, either.

Chess players spend enormous amounts of time studying openings for a reason.

> Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts

So, even assuming that one can train a model on reasoning/cognitive abilities, how does one pick the relevant texts for a desired outcome?

reply
I'd suggest that a measure like 'density[1]/parameter' as you put it will asymptotically rise to a hard theoretical limit (that probably isn't much higher than what we have already). So quite unlike Moore's Law.
reply
deleted
reply
Doesn’t the widely accepted Bitter Lesson say the exact opposite about specialized models vs generalized?
reply
The corollary to the bitter lesson is that in any market meaningful time scale a human crafted solution will outperform one which relies on compute and data. It's only on time scales over 5 years that your bespoke solution will be over taken. By which point you can hand craft a new system which uses the brute force model as part of it.

Repeat ad-nauseam.

I wish the people who quote the blog post actually read it.

reply
Bitter Lesson is about exploration and learning from experience. So RL (Sutton's own field) and meta learning. Specialized models are fine from Bitter Lesson standpoint if the specialization mixture is meta learned / searched / dynamically learned&routed.
reply