It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.
> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.
Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.
This piqued my interest on how it does it and after briefly checking the project it seems it only has two features for automatic photo categorization. 1) it can group photos by date and 2) It has face detection and recognition that uses trained weights (so ML "intelligence").
Also the fact that an M5 version will be coming, and they likely know they are going to sell out on day one (I expect we'll see a price correction from Apple for higher end configs of M5 studios, base price will probably stay the same), so they need to build up stock reserves.
qwen3.5-2b and qwen3.5-4b are great at document parsing. They can run on CPU
qwen3.6-27b and gemma4-31b are borderline better than the human eye in some cases. Their OCR isn't perfect, but they're seriously good. They can still run on the CPU but you'll be waiting minutes per document.
You can demand JSON, YAML, MD, or freeform text just by varying the prompt. Even if you have a custom template, you can just put that in the prompt and they'll do an OK-ish job.
There's also models that aren't in the r/locallama zeitgeist. IBM released a new 4b parameter model for structured text extraction last week, and there's a sea of recent chinese OCR models too.
IMO the open wights models are so good that in a lot of cases it's not worth paying frontier labs for OCR purposes. The only barrier to entry is the effort to set up a pipeline, and havin the spare CPU/GPU capacity.
Besides those, there are a few smaller open-weights models that are dedicated for OCR tasks, for instance DeepSeek-OCR-2 and IBM granite-vision-4.1-4b. (They can be found on huggingface.co)
The dedicated vision models can be run on much cheaper hardware, including smartphones, than the big models that can process images besides text.
Similarly, besides bigger multimodal models, that can accept audio, images or text as imput, there are smaller open-weights models that are dedicated for speech recognition, e.g. Xiaomi MiMo-V2.5-ASR and IBM granite-speech-4.1-2b.
Isn't that a function of RAM supply not being available now?
Even if that weren't the case, every corp _needs_ you to be on a subscription.
That's an interesting way to view the world. I mean, utterly stupid as it is, but interesting.
But the previous sentence is even stupider (a Perl script 10 years ago could write code like Qwen does now?), so I guess at least it's consistent.
Who runs IDE with LLM agents accessing your local filesystem, on bare metal?
Or am I alone to run everything LLM related on my VM just for development work. Then because of ZED genius decision, you need to share your GPU to VM, then some important features will not work, like snapshots. So you also need workaround for this, etc.
Too much hassle, Zed is not for me.
But I'm anti-Apple, so maybe that's the reason :)
Btw, even "ImHex" devs realized this and they're providing version without acceleration for VM use. They're using ImGui. Using it for local desktop app UI is also ridiculous, imho. Whatever.
Doesn’t ghostty also use graphics acceleration? I was under the impression that rendering text is a relatively challenging graphics compute task.
Maybe the future is a selection of local, specific stack trained models?
* Have a box with sufficient spare (V)RAM -- probably 8G for simple categorization with qwen3.5-4b, and 24G or more for more intelligent categorization with qwen3.6-27b or gemma4-31b.
* Download or compile llama.cpp. Choose a model, then choose one of the "quantized" builds that will actually fit on your hardware. There are literally hundreds to thousands of these per model on Hugging Face.
* Spend half a day tuning command-line parameters until llama.cpp doesn't crash.
* Watch llama.cpp regularly OOM itself, then put it in a systemd service with a memory limit so it doesn't take the entire machine down when it dies.
* Download all your photos to a folder.
* Start vibing a Python script to categorize your images by repeatedly prompting the LLM with each image in turn.
* Spend days tweaking/refining the prompt to try to get the LLM to actually do what you want.
The endgame is one of:
* The local model categorizes your images. Yay.
* The local model is too slow and you give up. Boo.
* The local model is too slow, so you spend $1k-$10k on hardware. Your image categorization task becomes a cover story for buying new gear. Yay.
* The local model can't understand your categorization metric, so you give up. Boo.
* You eagerly await news of the next open model being released. Yay?
* You consider replacing your local model with a frontier model, but then you realize you'd be spending $500 to categorize your photos. Boo.
* You refuse to allow Google/Gemini/Anthropic to train on your nudes. Boo.
I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.
But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"
Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.
It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.
Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).
I’m interested in self-hosting for privacy and control. I already owned the hardware I’m testing with, so my spend is limited to time and electricity.
The “LLM pods” you describe will be loaded with spyware and adware (see: Smart TVs), and average consumers won’t max their compute around the clock so naturally data centers are able to make more efficient use of hardware by maximizing utilization.
The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.
What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?
I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.
Reciprocal?
I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:
#!/bin/bash
#
# temporary shell version
eval "$(conda shell.bash hook)"
conda activate comfy-env
comfy launch -- --lowvram --cpu-vae
Here are some of the images: https://imgbox.com/nqjYhdx3 https://imgbox.com/93vSWFic https://imgbox.com/qs1898dz
I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.
I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately
I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6
I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet
but I'm open to seeing what people's workflows are
It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data.
I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code.
LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel.
Edit: the model is Qwen3.6-35B
FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.
But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.
I'm going to switch to local LLMs for most stuff soon.
Thot_experiment is saying that his 2016 Toyota Prius is a great and reliable car for his daily commute and running errands.
Whereas everyone is screeching about its capability gap with a Lockheed Martin F35 lightning.
(of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)
Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand
> If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs.
I predict the B200 data centers we're build today will be obsolete in 3 years and we'll be using whatever models and hardware that isn't even on a road map today. Likely not NVIDIA, likely not OpenAI or Anthropic. Maybe Chinese?
In the mean time, we must continue building software with the clumsy coding agents tied to cloud services as this (for now) seems to be about the only area where AI economically makes sense.
If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run.
> For those of us a bit crazy, we are running KimiK2.6, GLM5.1
Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware.
A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.
2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.
If you believe what you read here, the gap is closing fast.
For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening.
Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.
Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].
If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.
10 years ago I was using 16GB in my MBP and today it's 48GB. It's just a 3x increase during mostly a bonanza period.
And the Mac Studio was available with 512GB until ram got scarce and they cut the max in half recently.
There's plenty of demand for RAM right now. We'll see how this turns out.
Because late stage capitalism demands endless growth in order to pay executives and shareholders (especially those late to the train) more and more YoY.
And those requirements for growth mean that cost cutting is needed. Over the past few decades cost _have_ been cut, building things more efficiently, components becoming cheaper, larger volumes in mass manufacturing.
But we have already reached a point where there are no other places to cut than the quality of the product itself. Look to shrinkflation in food and other places - look at how "live action" versions are being made of previously animated movies, how game franchises from 2 decades ago are being brought back from the dead, the huge influx of remasters etc.
Why? Because it's cheaper to revive/reuse an existing IP than it is to create a new one + it guarantees success with the drooling consumer masses. And cheaper = more Ferraris for the multi millionaire/billionaire execs.
See how much Mario movie made? Just wait...bet you there'll be a live action version. ;)
However that's not the real battle here. The real battle is control of information to operate over.
While I might have access to a decent model - I don't have the huge integrated databases of everything that companies like Google have, and increasingly governments will accumulate.
As a citizen AI operating of these large datasets is where the concern should be.
This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud).
I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile.
I guess, it'll most likely be an AI processing and everything else becoming API.
In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs.
The question is would you choose to save $10 a day if it causes your inference to slow down 10x and waste 2 hours a day waiting on stuff.
To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.
I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.
You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).
People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.
Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.
Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.
With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.
This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.
It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!
i don't comprehend why people are in such disbelief at how much better this stuff runs on a mac studio than on NVIDIA hardware with 1/5th the VRAM. look, what can i say? NVIDIA is a bigger rip off than Apple is!
You are going off vibes alone, this is easily verified, please go verify.
What makes you think they have zero reason to subsidize, because the providers aren't a household names you assume they wouldn't operate at a loss? Whats your logic here? You make no sense.
Also, a lot of money is being made on input tokens and cached tokens, which are much cheaper to compute.
DeepSeek published their math for serving the V3/R1 models. They were 535% profitable: https://github.com/deepseek-ai/open-infra-index/blob/main/20...
If Anthropic and OpenAI are subsidizing the metered API usage, their model is going to end up just as successful as MoviePass. They are burning enough money on the training costs already.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
Serving models on dedicated hardware is not the same as your at home 150t/s thing. Inference is measured in thousands of tokens / s in aggregate (i.e. for all the sessions in parallel). That's how they make money.
If you have a machine running at 150 tok/ps you can only make $5820 a month at $15 per 1mm running 24/7. It costs a hell of a lot more than 6k a month to run Claude 4.7 @ 150 tok/ps on that machine 24/7.
This math is a bit off, because you have input tokens too, but regardless its still not profitable especially for how long it takes to turn around a request and the caching is probably not all that profitable.
The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token.
The idea that everyone is spinning up a $2 million in GPUs to scan their email inbox, search the web or avoid learning something is still ridiculous to me regardless.
Not if you're OK with 4-bit quantization. More like $30K-$50K one time.
Spring for 8 RTX6000s instead of 4, and you can use the full-precision K2.6 weights ( https://github.com/local-inference-lab/rtx6kpro/blob/master/... ).
I don't think cloud models are going away; the hardware for good perf is expensive and higher param count models will remain smarter for a looong time. Even if the hardware cost for kind-of-usable perf fell to only $10k, cloud ones will be way faster and you'd need a lot of tokens to break even.
I think local AI will win in its niche by repurposing users' existing hardware, especially as cloud hardware itself gets increasingly bottlenecked in all sorts of ways and the price of cloud tokens rises. You don't have to care about "bad" performance when you've got dedicated hardware that runs your workloads 24/7. Time-critical work that also requires the latest and greatest model can stay on the cloud, but a vast amount of AI work just isn't that critical.
There will not ever be a monthly subscription for LLM tokens. The economics isn't there.
Local tokens will always be cheaper.
Well your thinking is completely vibes based and not cemented in any reality I exist in.
They're not smarter, they just know more stuff.
You probably don't need knowledge about Pokemon or the Diamond Sutra in your enterprise coding LLM.
The "smarts" comes from post-training, especially around tool use.
> Just write your own fkin code people
Bro is nostalgic for googling random stack overflow threads for 10 days to figure out a bug the agent fixes in an hour.
The cost of cloud compute actually hasn't gone down for old hardware all that much, it still costs $500.00 a year rent 4 core i7700k that's 10 years old. Don't expect much more valuable hardware, like modern GPUs to deflate in price all that quickly.
There's 3 fabs in the world that make ddr7 and they aren't going to be selling their stock to consumers going forward, it will be purchased by datacenters almost entirely and stay in them until EOL.
Your brain is going to atrophy (this is proven), they'll raise the price to something thats closer to break even and you'll be forced to pay it because you no longer have those muscles.
I think that is a very narrow perspective. Enormous numbers of consumers own $50,000 cars, but a pair of $2000 GPUs is "not consumer"?
I agree with your view that cheap tokens on SOTA are a trap-- people should use local AI or no AI.
$50k is a median priced car in the US. I'd guess >99.9% of people do not own $4000 of GPUs. I consider myself a computer person and I dont think I even own $4000 of computer hardware in total
A top-spec MacBook Pro is >$4k, so I assure you that plenty of computer people do own $4k of computer hardware.
Hell, most tech folks are wandering around with a ~$1k smartphone in their pocket too.
A car is super useful, so is an AI. But even if we decide cars are incomparably more useful a great many people pay much more than $4000 over the minimum viable car, and that's money that could be deployed to secure access to private, secure, and autonomous AI facilities. A few thousand dollars in computing is consumer hardware, or at least could easily be with more reason and awareness driving adoption.
People spend a LOT of money in things less useful than local copy of qwen3.6-27b can be.
A friend an I had previously worked on an entropy extraction scheme and he recently got around to making a writeup about our work: https://wuille.net/posts/binomial-randomness-extractors/
I instructed the agent to read the URL, implement the technique in C++ for 32-bit registers, then make a SIMD version that interleaves several extractors in parallel for better performance. It implemented it (not hard since there was an implementation there that it read), then wrote more extensive tests. Then it vectorized it. It got confused a few times during debugging because the algorithm uses some number theory tricks so that overflows of intermediate products don't matter and it was obviously trained a lot on ordinary code were such overflows are usually fatal. I instructed it to comment the code explaining why the overflows are fine and had it continue which mostly solved its confusion.
It successfully got the initial 12MB/s scalar implementation to about 48MB/s. Then I told it to keep optimizing until it reaches 100MB/s. I came back the next day and it had stopped after 6 hours when it achieved just over 100MB/s. Reading what it did: it went off looking at disassembly, figured out what hardware it was running on, and reading microarch timing tables online and made some better decisions, tried a lot of things that didn't work, etc. (And of course, the implementation is correct).
I'm pretty skeptical about AI and borderline hateful of many people who (ab)use it and are deluded by it-- but I think this experience shows that a small local model can be objectively useful.
(oh and this experience was also while I only had the model running at 19tok/s)
Running the model in a loop where it can get feedback from actually testing stuff allows you to make progress in spite of making many mistakes.
I could have done this work myself but I didn't have to and I certainly spent less time checking in and prodding it than it would have taken me to do it. In my case I wondered how much faster parallel extractors using SIMD might be-- an idle curiosity that would have gone unanswered if not for the AI.
Congrats, but you're in the 0.0001% thats not just frying their brains, fapping to their local models or doing various magic tricks like a toddler entertained by playing with velcro.
At the end of the day you lost an opportunity to improve yourself and excercise your brain, maybe the opportunity cost is worth it idk, but Im going to keep taking things slow.
Handmade swiss watches > mass manufactured immitations. Handmade clothes > walmart clothes.
There are plenty of other uses that people have been making for a long time-- e.g. I know someone who uses a fine tuned local model to sort their incoming email and scan their outgoing messages for accidental privacy leaks.
I don't agree with your assessment on an opportunity lost-- I got my reps in on the original work, the AI gave an incremental step forward which made the whole exercise somewhat more valuable to me with minimal additional cost. I think this improves the cost vs benefit in a way that makes me more likely to try other pointless activities, knowing that when I run out of gas I can toss it to AI to try some variations.
Sometimes you're also 27 steps deep on a nested subproblem and you're really just trying to solve sometime. Even in finr craftsmanship not every step needs to be about maximum craftsmanship. :) Sometimes it's just good to get something done.
I think this is much like any other tool. One can carve furniture using only hand tools, but the benefits of a router are hard to dispute. Both approaches exist in the world and sometimes both are used in concert.
As far as people frying their brains with AI -- you don't need local models for that, plenty of people are driving themselves into deep personally and socially destructive delusion just using the chat interfaces.
I agree with you, there's a way to use them responsibly like your router anology, I just think most aren't doing this correctly and its a slippery slope. I'll contend that you probably have used them responsibly in your example.