"They" fully well know that they current frontier model are maybe 6 month ahead of what people will have access to without their control. See Deepseek as Exibit B
The reason you can't run these locally are more with the fact that those mythos sized models require extreme amount of memory and processing power to run at acceptable speeds. And neither you, nor I can afford to pay for those resources to run those models locally. A big reason is that "running locally" means running on your own hardware. And for almost everyone this means "running on hardware that will spent a big portion of its time just sleeping". Because data center and providers have higher utilization rates, they can easily outpace you. That and the fact that when they place an order it's usually for hundreds of thousands of units.
This piqued my interest on how it does it and after briefly checking the project it seems it only has two features for automatic photo categorization. 1) it can group photos by date and 2) It has face detection and recognition that uses trained weights (so ML "intelligence").
I got away from google images and upload to my own Immich instance.
I also use an open source camera app on fdroid to degoogle that whole path.
Also the fact that an M5 version will be coming, and they likely know they are going to sell out on day one (I expect we'll see a price correction from Apple for higher end configs of M5 studios, base price will probably stay the same), so they need to build up stock reserves.
qwen3.5-2b and qwen3.5-4b are great at document parsing. They can run on CPU
qwen3.6-27b and gemma4-31b are borderline better than the human eye in some cases. Their OCR isn't perfect, but they're seriously good. They can still run on the CPU but you'll be waiting minutes per document.
You can demand JSON, YAML, MD, or freeform text just by varying the prompt. Even if you have a custom template, you can just put that in the prompt and they'll do an OK-ish job.
There's also models that aren't in the r/locallama zeitgeist. IBM released a new 4b parameter model for structured text extraction last week, and there's a sea of recent chinese OCR models too.
IMO the open wights models are so good that in a lot of cases it's not worth paying frontier labs for OCR purposes. The only barrier to entry is the effort to set up a pipeline, and havin the spare CPU/GPU capacity.
Besides those, there are a few smaller open-weights models that are dedicated for OCR tasks, for instance DeepSeek-OCR-2 and IBM granite-vision-4.1-4b. (They can be found on huggingface.co)
The dedicated vision models can be run on much cheaper hardware, including smartphones, than the big models that can process images besides text.
Similarly, besides bigger multimodal models, that can accept audio, images or text as imput, there are smaller open-weights models that are dedicated for speech recognition, e.g. Xiaomi MiMo-V2.5-ASR and IBM granite-speech-4.1-2b.
Isn't that a function of RAM supply not being available now?
Even if that weren't the case, every corp _needs_ you to be on a subscription.
That's an interesting way to view the world. I mean, utterly stupid as it is, but interesting.
But the previous sentence is even stupider (a Perl script 10 years ago could write code like Qwen does now?), so I guess at least it's consistent.
Who runs IDE with LLM agents accessing your local filesystem, on bare metal?
Or am I alone to run everything LLM related on my VM just for development work. Then because of ZED genius decision, you need to share your GPU to VM, then some important features will not work, like snapshots. So you also need workaround for this, etc.
Too much hassle, Zed is not for me.
But I'm anti-Apple, so maybe that's the reason :)
Btw, even "ImHex" devs realized this and they're providing version without acceleration for VM use. They're using ImGui. Using it for local desktop app UI is also ridiculous, imho. Whatever.
Doesn’t ghostty also use graphics acceleration? I was under the impression that rendering text is a relatively challenging graphics compute task.
Maybe the future is a selection of local, specific stack trained models?
I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.
But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"
Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.
It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.
Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).
I’m interested in self-hosting for privacy and control. I already owned the hardware I’m testing with, so my spend is limited to time and electricity.
The “LLM pods” you describe will be loaded with spyware and adware (see: Smart TVs), and average consumers won’t max their compute around the clock so naturally data centers are able to make more efficient use of hardware by maximizing utilization.
* Have a box with sufficient spare (V)RAM -- probably 8G for simple categorization with qwen3.5-4b, and 24G or more for more intelligent categorization with qwen3.6-27b or gemma4-31b.
* Download or compile llama.cpp. Choose a model, then choose one of the "quantized" builds that will actually fit on your hardware. There are literally hundreds to thousands of these per model on Hugging Face.
* Spend half a day tuning command-line parameters until llama.cpp doesn't crash.
* Watch llama.cpp regularly OOM itself, then put it in a systemd service with a memory limit so it doesn't take the entire machine down when it dies.
* Download all your photos to a folder.
* Start vibing a Python script to categorize your images by repeatedly prompting the LLM with each image in turn.
* Spend days tweaking/refining the prompt to try to get the LLM to actually do what you want.
The endgame is one of:
* The local model categorizes your images. Yay.
* The local model is too slow and you give up. Boo.
* The local model is too slow, so you spend $1k-$10k on hardware. Your image categorization task becomes a cover story for buying new gear. Yay.
* The local model can't understand your categorization metric, so you give up. Boo.
* You eagerly await news of the next open model being released. Yay?
* You consider replacing your local model with a frontier model, but then you realize you'd be spending $500 to categorize your photos. Boo.
* You refuse to allow Google/Gemini/Anthropic to train on your nudes. Boo.
The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.
What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?
I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.
Reciprocal?
I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:
#!/bin/bash
#
# temporary shell version
eval "$(conda shell.bash hook)"
conda activate comfy-env
comfy launch -- --lowvram --cpu-vae
Here are some of the images: https://imgbox.com/nqjYhdx3 https://imgbox.com/93vSWFic https://imgbox.com/qs1898dz
I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.
I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately
I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6
I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet
but I'm open to seeing what people's workflows are
It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data.
I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code.
LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel.
Edit: the model is Qwen3.6-35B