upvote
Look for the current crop of local Mixture of Experts models, where it seems like they've made inroads on the O(n^2) context attention cost problem. Several folks have mentioned Qwen, but there's many more of that ilk. Several of them actually score really high on benchmarks. But when I mess with one of them locally by hand myself, (I have a 3090), it feels a bit like last year's Sonnet. They don't quite make the leaps of understanding you get from Opus.

* Weird thing of the day: https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...

reply
Well, I reinstalled LM Studio today after some ~10 months since I last used it, just to test Gemma 4. On my PC with 32GB RAM and 4070 Ti (12GB VRAM), it (Gemma 4 26B A4B Q4_K_M) loads and runs reasonably fast, with no manual parameter or configuration tuning - just out of the box, on fresh install - and delivers results usable results on the level I remember expecting from SOTA cloud models 12-16 months ago. And handles image input, too. I'm quite impressed with it, TBH. It's something I can finally see myself using, and yay, it even leaves some RAM and VRAM left for doing other stuff.
reply
You can run SOTA local MoE models very slowly by streaming the weights in from a fast PCIe 5 SSD. Kimi 2.5 (generally considered in the ballpark of current sonnet, not opus of course) has been measured as 2 tok/s on Apple M5 hardware, which is the best-case performance unless you have niche HEDT hardware with lots of PCIe lanes to attach storage to and figure out how to use that amount of parallel transfer throughput.
reply
A ~$5000 USD Macbook can run open source models that are competitive with GPT 3.5 or Sonnet 3. So on nice consumer hardware you can have the original groundbreaking ChatGPT experience that runs locally.
reply
Qwen 3.5, Gemma 4
reply
[flagged]
reply