> solves the problem of too much demand for inference
False, it creates consumer demand for inference chips, which will be badly utilised.
> also would use less electricity
What makes you think that? (MAYBE you can save power on cooling. But not if the data center is close to a natural heat sink)
> It's just a matter of getting the performance good enough.
The performance limitations are inherent to the limited compute and memory.
> Most users don't need frontier model performance.
What makes you think that?
I think the opposite is true. Local inference doesn't have to go over the wire and through a bunch of firewalls and what have you. The performance from just regular consumer hardware with local, smaller models is already decent. You're utilizing the hardware you already have.
> The performance limitations are inherent to the limited compute and memory.
When you plug in a local LLM and inference engine into an agent that is built around the assumption of using a cloud/frontier model then that's true.
But agents can be built around local assumptions and more specific workflows and problems. That also includes the model orchestration and model choice per task (or even tool).
The Jevons Paradox comes into play with using cloud models. But when you have less resources you are forced to move into more deterministic workflows. That includes tighter control over what the agent can do at any point in time, but also per project/session workflows where you generate intermediate programs/scripts instead of letting the agent just do what ever it wants.
I give you an example:
When you ask a cloud based agent to do something and it wants more information, it will often do a series of tool calls to gather what it thinks it needs before proceeding. Very often you can front load that part, by first writing a testable program that gathers most of the necessary information up front and only then moving into an agentic workflow.
This approach can produce a bunch of .json, .md files or it can move things into a structured database or you can use embeddings or what have you.
This can save you a lot of inference, make things more reusable and you don't need a model that is as capable if its context is already available and tailored to a specific task.
Meaning: these 5000 tokens consume tiny amounts of energy being moved all around from the data center to your PC, but enormous amounts of energy being generated at all. An equivalent webpage with the same amount of text as these tokens would be perceived as instant in any network configuration. Just some kilobytes of text. Much smaller than most background graphics. The two things can't be compared at all.
However, just last week there have been huge improvements on the hardware required to run some particular models, thanks to some very clever quantisation. This lowers the memory required 6x in our home hardware, which is great.
In the end, we spent more energy playing videogames during the last two decades, than all this AI craze, and it was never a problem. We surely can run models locally, and heat our homes in winter.
There are so many CPUs, GPUs, RAM and SSDs which are underutilized. I have some in my closet doing 5% load at peek times. Why would inference chips be special once they become commodity hardware?
The fact that today's and yesterday's models are quite capable of handling mundane tasks, and even companies behind frontier models are investing heavily in strategies to manage context instead of blindly plowing through problems with brute-force generalist models.
But let's flip this around: what on earth even suggests to you that most users need frontier models?
Having access to a model that is drawing from good sources and takes time to think instead of hallucinating a response is important in many domains of life.
Looking at actual users of LLMs
There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)
Worth adding that I had reasoning on for the Tiananmen question, so I could see the prep for the answer, and it had a pretty strong current of "This is a sensitive question to PRC authorities and I must not answer, or even hint at an answer". I'm not sure if a research tool would be sufficient to overcome that censorship, though I guess I'll find out!
Getting the local weather using a free API like met.no is a good first tool to use.
It needs to be just smart enough to use the tools and distill the responses into something usable. And one of the tools can be "ask claude/codex/gemini" so the local model itself doesn't actually need to do much.
That doesn't fix the "you don't know what you don't know" problem which is huge with smaller models. A bigger model with more world knowledge really is a lot smarter in practice, though at a huge cost in efficiency.
Is there already some research or experimentation done into this area?
Picking a model that's juuust smart enough to know it doesn't know is the key.
No. It runs on MacOS but uses Metal instead of MLX.
MLX is faster because it has better integration with Apple hardware. On the other hand GGUF is a far more popular format so there will be more programs and model variety.
So its kinda like having a very specific diet that you swear is better for you but you can only order food from a few restaurants.
Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.
It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.
I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.
----
Full reaction:
Yes but perhaps not in a way you might expect. Qwen's reasoning ability isn't exactly groundbreaking. But it's good enough to weave a story, provided it has some solid facts or notes. GraphRAG is definitely a good way to get some good facts, provided your notes are valuable to you and/or contain some good facts.
So the added value is that you now have a super charged information retrieval system on your notes with an LLM that can stitch loose facts reasonably well together, like a librarian would. It's also very easy to see hallucinations, if you recognize your own writing well, which I do.
The second thing is that I have a hard time rereading all my notes. I write a lot of notes, and don't have the time to reread any of them. So oftentimes I forget my own advice. Now that I have a super charged information retrieval system on my notes, whenever I ask a question: the graphRAG + LLM search for the most relevant notes related to my question. I've found that 20% of what I wrote is incredibly useful and is stuff that I forgot.
And there are nuggets of wisdom in there that are quite nuanced. For me specifically, I've seen insights in how I relate to work that I should do more with. I'll probably forget most things again but I can reuse my system and at some point I'll remember what I actually need to remember. For example, one thing I read was that work doesn't feel like work for me if I get to dive in, zoom out, dive in, zoom out. Because in the way I work as a person: that means I'm always resting and always have energy for the task that I'm doing. Another thing that it got me to do was to reboot a small meditation practice by using implementation intentions (e.g. "if I wake up then I meditate for at least a brief amount of time").
What also helps is to have a bit of a back and forth with your notes and then copy/paste the whole conversation in Claude to see if Claude has anything in its training data that might give some extra insight. It could also be that it just helps with firing off 10 search queries and finds a blog post that is useful to the conversation that you've had with your local LLM.
I was looking for details about cars and it started interjecting how the safety would affect my children by name in a conversation where I never mention my children. I was asking details about Thunderbolt and modern Ryzen processors and a fresh Gemini chat brought in details about a completely unrelated project I work on. I’ve always thought local LLMs would be important, but whatever Google did in the past few weeks has made that even more clear.
Maybe in the distant future when device compute capacity has increased by multiples and efficiency improvements have made smaller LLMs better.
The current data center buildouts are using GPU clusters and hybrid compute servers that are so much more powerful than anything you can run at home that they’re not in the same league. Even among the open models that you can run at home if you’re willing to spend $40K on hardware, the prefill and token generation speeds are so slow compared to SOTA served models that you really have to be dedicated to avoiding the cloud to run these.
We won’t be in a data center crunch forever. I would not be surprised if we have a period of data center oversupply after this rush to build out capacity.
However at the current rate of progress I don’t see local compute catching up to hosted models in quality and usability (speed) before data center capacity catches up to demand. This is coming from someone who spends more than is reasonable on local compute hardware.
But a local model + good harness with a robust toolset will work for people more often than not.
The model itself doesn't need to know who was the president of Zambia in 1968, because it has a tool it can use to check it from Wikipedia.
They've usually been intended for ereader/off-grid/post-zombie-apocalypse situations but I'd guess someone is working on an llm friendly way to install it already.
Be interesting to know the tradeoffs. The Tienammen square example suggests why you'd maybe want the knowledge facts to come from a separate source.
ChatGPT free falls back to GPT-5.2 Mini after a few interactions.
This is all on top of the (to me) insufferable tone of the non-thinking models, but that might well be how most users prefer to be talked to, and whether that's how these models should accordingly talk is a much more nuanced question.
Regardless of that, everybody deserves correct answers, even users on the free tier. If this makes the free tier uneconomical to serve for hours on end per user per day, then I'd much rather they limit the number of turns than dial down the quality like that.
Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.
Is the average person just talking to it about their day or something?
"when hostapd initializes 80211 iface over nl80211, what attributes correspond to selected standard version like ax or be?"
It works fine, avoids falling into trap due to misleading question. Probably works even better for more popular technologies. Yeah, it has higher failure rates but it's not a dealbreaker for non-autonomous use cases.You can try asking it the same question as Claude and compare the answers. I can guarantee you that the ChatGPT answer won't fit on a single screen on a 32" 4k monitor.
Claude's will.
Most users are fixing grammar/spelling, summarising/converting/rewriting text, creating funny icons, and looking up simple facts, this is all far from frontier model performance.
I've a feeling that if/when Apple release their onboard LLM/Siri improvements that can call out if needed, the vast majority of people will be happy with what they get for free that's running on their phone.
I think what "need" youspeak of is a bit of a colored statement.
For example, last week I built a real-time voice AI running locally on iPhone 15.
One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.
I first tried the Qwen 3.5 0.8B Q4_K_S and the model couldn't hold a basic conversation. Although I haven't tried lower quants on 2B.
I'm also interested on the Apple Foundation models, and it's something I plan to try next. AFAIK it's on par with Qwen-3-4B [0]. The biggest upside as you alluded to is that you don't need to download it, which is huge for user onboarding.
[0] https://machinelearning.apple.com/research/apple-foundation-...
People are making a comparison of the cost per inference or token or whatever and saying datacenters are more efficient which makes obvious sense. What i'm saying is if we eliminate the need for building out dozens of gigawatt datacenters completely then we would use less electricity. I feel like this makes intuitive sense. People are getting lost in the details about cost per inference, and performance on different models.
could also be considered a triage layer
If you have 100 mbit/sec internet connection at home, a computer in a data centre has 10 gbit/sec, but the server is serving 200 concurrent clients — your bandwidth is twice as fast.
When VCs inevitably ask their AI labs to start making money or shut down, those free open source LLMS will cease to be free.
Chinese AI labs have to release free open source models because they distill from OpenAI and Anthropic. They will always be behind. Therefore, they can't charge the same prices as OpenAI and Anthropic. Free open source is how they can get attention and how they can stay fairly close to OpenAI and Anthropic. They have to distill because they're banned from Nvidia chips and TSMC.
Before people tell me Chinese AI labs do use Nvidia chips, there is a huge difference between using older gimped Nvidia H100 (called H20) chips or sneaking around Southeast Asia for Blackwell chips and officially being allowed to buy millions of Nvidia's latest chips to build massive gigawatt data centers.
They dont really have to though, they just need to be good enough and cheaper (even if distilled). That being said, it is true they are gaining a lot of visibility (specially Qwen) because of being open-source(weight).
Hardware-wise they seem they will catch-up in 3-5 years (Nvidia is kind of irrelevant, what matters is the node).
Chips take about 3 years to design. Do you think China will have Feymann-level AI systems in 3 years?
I think in 3 years, they'll have H200-equivalent at home.
Car manufacturers said the same.
I could see the model becoming part of the OS.
Of course Google and Microsoft will still want you to use their models so that they can continue to spy on you.
Apple, AMD and Nvidia would sell hardware to run their own largest models.
How would it use less electricity? I’d like to learn more.
Service providers that do batch>1 inference are a lot more efficient per watt.
Local inference can only do batch=1 inference, which is very inefficient.
SSD weights offload makes it feasible to run SOTA local models on consumer or prosumer/enthusiast-class platforms, though with very low throughput (the SSD offload bandwidth is a huge bottleneck, mitigated by having a lot of RAM for caching). But if you only need SOTA performance rarely and can wait for the answer, it becomes a great option.
As it stands today, local LLMs don't work remotely as well as some people try to picture them, in almost every way -- speed, performance, cost, usability etc. The only upside is privacy.
I run 32b agents locally on a big video card, and smaller ones in CPU, but the lack there isn't the logic or reasoning, it is the chain of tooling that Claude Code and other stacks have built in.
Doing a lot of testing recently with my own harness, you would not believe the quality improvement you can get from a smaller LLM with really good opening context.
Even Microsoft is working on 1-bit LLMs...it sucks right now, but what about in 5 years?
But the OP is correct -- everything will have an LLM on it eventually, much sooner than people who do not understand what is going on right now would ever believe is possible.
Your idea of what people need from Local LLMs and others are different. Not everybody needs a /r/myboyfriendisai level performance.
Agree, and I think of it this way: for a lot of businesses, it already makes sense to have a bunch of more powerful computers and run them centralized in a datacenter. Nevertheless, most people at most companies do most of their work on their Macbook Air or Dell whatever. I think LLMs will follow a similar pattern: local for 90% of use cases, powerful models (either on-site in a datacenter or via a service) for everything else.
Sorry to shatter your bubble, but this is patently false, LLMs are far more efficient on hardware that simultaneously serves many requests at once.
There's also the (environmental and monetary) cost of producing overpowered devices that sit idle when you're not using them, in contrast to a cloud GPU, which can be rented out to whoever needs it at a given moment, potentially at a lower cost during periods of lower demand.
Many LLM workloads aren't even that latency sensitive, so it's far easier to move them closer to renewable energy than to move that energy closer to you.
The LLM inference itself may be more efficient (though this may be impacted by different throughput vs. latency tradeoffs; local inference makes it easier to run with higher latency) but making the hardware is not. The cost for datacenter-class hardware is orders of magnitude higher, and repurposing existing hardware is a real gain in efficiency.
If you're purely repurposing hardware that you need anyway for other uses, that doesn't really matter.
(Besides, for that matter, your utilization might actually rise if you're making do with potato-class hardware that can only achieve low throughput and high latency. You'd be running inference in the background, basically at all times.)
You might want to read this: https://arxiv.org/abs/2502.05317v2
Data centers use GPU batching, much higher utilisation rates, and more efficient hardware. It's borderline two order of magnitude more efficient than your desktop.
A lot of stuff that we ask of these models isn't all that hard. Summarize this, parse that, call this tool, look that up, etc. 99.999% really isn't about implementing complex algorithms, solving important math problems, working your way through a benchmark of leet programming exercises, etc. You also really don't need these models to know everything. It's nice if it can hallucinate a decent answer to most questions. But the smarter way is to look up the right answer and then summarize it. Good enough goes a long way. Speed and latency are becoming a key selling point. You need enough capability locally to know when to escalate to something slower and more costly.
This will drive an overdue increase in memory size of phones and laptops. Laptops especially have been stuck at the same common base level of 8-16GB for about 15 years now. Apple still sells laptops with just 8GB (their new Neo). I had a 16 GB mac book pro in 2012. At the time that wasn't even that special. My current one has 48GB; enough for some of the nicer models. You can get as much as 256GB today.
DRAM costs are still skyrocketing, so no, I don't think so. It's more likely that we'll bring back wear-resistant persistent memory as formerly seen with Intel Optane.
Just plug in a stick in your USB-C port or add an M.2 or PCIe board and you'll get dramatically faster AI inference.
Put another way, there already exist add-in boards like this, and they’re called GPUs.
An "LLM chip" does not need that and so can be much more efficient.
Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.
I think Apple had something in the region of 143 billion in revenue in the last quarter.
Not saying it will happen - just that there are a variety of business models out there and in the end it all depends on where consumers put their money.
I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?
I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.
Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.
It's just wishful thinking (and hatred towards American megacorps). Old as the hills. Understandable, but not based on reality.
the webgpu model in my browser on my m4 pro macbook was as good as chatgpt 3.5 and doing 80+ tokens/s
Local is here.
If it has something like 80GB of VRAM, it'll cost $10k.
The actual local LLM chip is Apple Silicon starting at the M5 generation with matmul acceleration in the GPU. You can run a good model using an M5 Max 128GB system. Good prompt processing and token generation speeds. Good enough for many things. Apple accidentally stumbled upon a huge advantage in local LLMs through unified memory architecture.
Still not for the masses and not cheap and not great though. Going to be years to slowly enable local LLMs on general mass local computers.
CC: Claude Code
TC: total comp(ensation)
On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.
In fact the space seems to move at a rapid pace as more and more specialized models come out. There's a possible trajectory where open weight models will compete side by side or even be preferable for many use cases, just like what happened with OS's and SQL DB's.
Code tools that free my time up is very nice.