upvote
That's already happening. Qwen3.6 and Gemma4.

Basically small and medium models that are crazy well trained for their sizes.

Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.

Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.

reply
And how were those models developed and trained?
reply
> And how were those models developed and trained?

That's irrelevant to my decision to use local or not.

reply
That's not what this thread is about? We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has. Has it? I don't think so, those models are not in some way fundamentally different than other LLMs
reply
> We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has.

I didn't read "and how were those models trained" as "Are we there yet?"

reply
There's a percentage of people who love to question how the open models were trained.. they are almost always going to try and make some argument about using the closed frontier models for distillation as some form of theft.

Just totally forgetting that the frontier models themselves stole an insane amount to get to where they are.

It's theft all the way across the board, and when someone tries to make the argument that open models theft is bad, but Altman or Amodei's theft is good.. they are revealing a lot about themselves

reply
deleted
reply
The current LLMs are also "magic" so anything is possible. AFAIK there is no proof that the current architecture is optimal. And we have our brains as a pretty powerful local thinking machine as a counter-example to the idea that thinking has to happen in data centers.
reply
I want to ask what makes them magic, but even those building LLMs don't really know what happens when they run inference...

I have to assume current architectures aren't optimal though, the idea that we stumbled into the one and only optimal solution seems almost impossible.

reply
deleted
reply
I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs.

If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.

Not all that different from the old terminal & mainframe->pc shifts.

Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.

reply
> I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs

Definitely not the high end local LLMs. The small ones, yes, absolutely.

> If you project out that hardware just a couple of years

One of the biggest bottlenecks for LLMs is memory capacity and bandwidth. With the current glut for memory, it's unlikely we'll see lots of advancements in terms of average memory available or its bandwidth on regular (not super high end devices) in the coming years.

Alternatively, it's possible we get dedicated SMLs for e.g. phone specific use cases, that are optimised and run well.

reply
deleted
reply