undefined

points

[-]

I always wonder how much smaller and faster models could be if they were only trained on the latest versions of the languages I use, so for me that is PHP, SQL, HTML, JS, CSS, Dutch, English, plus tool use for my OS of choice (MacOS).

Right now it feels like hammering a house onto a nail instead of the other way around.

by ACCount3718 hours ago|

parent|

[-]

Not very. LLMs derive a lot of their capability profile from the sheer scale.

LLMs have something that's not entirely unlike the "g factor" in humans - a broad "capability base" that spans domains. The best of the best "coding LLMs" need both good "in-domain training" for coding specifically and a high "capability base". And a lot of where that "base" comes from is: model size and the scale of data and compute used in pre-training.

Reducing the model scale and pruning the training data would result in a model with a lower "base". It would also hurt in-domain performance - because capabilities generalize and transfer, and pruning C code from the training data would "unteach" the model things that also apply to code in PHP.

Thus, the pursuit of "narrow specialist LLMs" is misguided, as a rule.

Unless you have a well defined set bar that, once cleared, makes the task solved, and there is no risk of scope adjustment, no benefit from any future capability improvements above that bar, and enough load to justify the engineering costs of training a purpose-specific model? A "strong generalist" LLM is typically a better bet than a "narrow specialist".

In practice, this is an incredibly rare set of conditions to be met.

by weitendorf14 hours ago|

parent|

[-]

It's more complicated than that. Small specialized LLMS are IMO better framed as "talking tools" than generalized intelligence. With that in mind, it's clear why something that can eg look at an image and describe things about it or accurately predict weather, then converse about it, is valuable.

There are hardware-based limitations in the size of LLMs you can feasibly train and serve, which imposes a limit in the amount of information you can pack into a single model's weights, and the amount of compute per second you can get out of that model at inference-time.

My company has been working on this specifically because even now most researchers don't seem to really understand that this is just as much an economics and knowledge problem (cf Hayek) as it is "intelligence"

It is much more efficient to strategically delegate specialized tasks, or ones that require a lot of tokens but not a lot of intelligence, to models that can be served more cheap. This is one of the things that Claude Code does very well. It's also the basis for MOE and some similar architectures with a smarter router model serving as a common base between the experts.

by BarryMilo19 hours ago|

parent|

prev|

[-]

I seem to remember that's one of the first things they tried, but the general models tended to win out. Turns out there's more to learn from all code/discussions than from just JS.

by justinlivi11 hours ago|

parent|

[-]

From my own empirical research, the generalized models acting as specialists outperform both the tiny models acting as specialists and the generalist models acting as generalists. It seems that if peak performance is what you're after, then having a broad model act as several specialized models is the most impactful.

by rixed1 hours ago|

parent|

prev|

[-]

The analogy with human brains suggests that it would not end very well.

by Someone123418 hours ago|

parent|

prev|

[-]

Wouldn't that mean they're bad at migration tasks? I feel like for most languages, going from [old] to [current] is a fairly to very common usage scenario.

by nareyko18 hours ago|

parent|

prev|

[-]

[dead]

by red75prime18 hours ago|

prev|

[-]

> power users would be mostly running their own models

...with a fair amount of supervision, while frontier models would be running circles around them using project-specific memory and on-demand training (or whatever we would have by then).

by 3abiton17 hours ago|

parent|

[-]

Honestly right now it's mainly stagnation in frontiere model capabilities. Most of the recent afvancemdnts are towards generation speed, compression and tool usage. The quality of the models are not improving at the same rate as before. I doubt this big gap will continue, given that open source and especially chinese labs keep pushing well documented frontiere papers.

by darkerside17 hours ago|

parent|

prev|

[-]

Those will be great for projects that look just like everybody else's. That's not a knock. We'll see plenty of new systems built by anyone who needs one.

If you're building something groundbreaking and new, the advantage will be slim to none.

by littlestymaar15 hours ago|

parent|

prev|

[-]

If what you refer to by “on demand training ” is fine tuning, it's going to be much more efficient on a small model than a big one.

by red75prime13 hours ago|

parent|

[-]

LoRA can work with big models. But I mean sample-efficient RL.