upvote
LMArena actually has a nice Pareto distribution of ELO vs price for this

  model                        elo   $/M
  ---------------------------------------
  glm-5.1                      1538  2.60
  glm-4.7                      1440  1.41
  minimax-m2.7                 1422  0.97
  minimax-m2.1-preview         1392  0.78
  minimax-m2.5                 1386  0.77
  deepseek-v3.2-thinking       1369  0.38
  mimo-v2-flash (non-thinking) 1337  0.24
https://arena.ai/leaderboard/code?viewBy=plot&license=open-s...
reply
LMArena isn't very useful as a benchmark, however I can vouch for the fact that GLM 5.1 is astonishingly good. Several people I know who have a $100/mo Claude Code subscription are considering cancelling it and going all in on GLM, because it's finally gotten (for them) comparable to Opus 4.5/6. I don't use Opus myself, but I can definitely say that the jump from the (imvho) previous best open weight model Kimi K2.5 to this is otherworldly — and K2.5 was already a huge jump itself!
reply
qwen3.5/3.6 (30B) works well,locally, with opencode
reply
Mind you, a 30B model (3B active) is not going to be comparable to Opus. There are open models that are near-SOTA but they are ~750B-1T total params. That's going to require substantial infrastructure if you want to use them agentically, scaled up even further if you expect quick real-time response for at least some fraction of that work. (Your only hope of getting reasonable utilization out of local hardware in single-user or few-users scenarios is to always have something useful cranking in the background during downtime.)
reply
For a business with ten or more engineers/people-using-ai, it might still make sense to set this up. For an individual though, I can’t imagine you’d make it through to positive ROI before the hardware ages out.
reply
It's hard to tell for sure because the local inference engines/frameworks we have today are not really that capable. We have barely started exploring the implications of SSD offload, saving KV-caches to storage for reuse, setting up distributed inference in multi-GPU setups or over the network, making use of specialty hardware such as NPUs etc. All of these can reuse fairly ordinary, run-of-the-mill hardware.
reply
Since you need at least a few of H100 class hardware, I guess you need at least few tens of coders to justify the costs.
reply
What near SOTA open models are you referring to?
reply
I'm backing up a big dataset onto tapes, so I wanted to automate it. I have an idle 64Gb VRAM setup in my basement, so I decided to experiment and tasked it with writing an LTFS implementation. LTFS is an open standard for filesystems for tapes, and there's an implementation in C that can be used as the baseline.

So far, Qwen 3.6 created a functionally equivalent Golang implementation that works against the flat file backend within the last 2 days. I'm extremely impressed.

reply
It is surprisingly competent. It's not Opus 4.6 but it works well for well structured tasks.
reply
I want to bump this more than just a +1 by recommending everyone try out OpenCode. It can still run on a Codex subscription so you aren’t in fully unfamiliar territory but unlocks a lot of options.
reply
The Codex TUI harness is also open source and you can use open models with it, so you can stay in even more familiar territory.
reply
pi-coding-agent (pi.dev) is also great. I've been using it with Gemma 4 and Qwen 3.6.
reply
The thing I dislike about OpenCode is the lack of capabilities of their editor, also, resource intensive, for some reason on a VM it chuckles each 30 mins, that I need to discard all sessions, commits, etc.

I don't know if it is bun related, but in task manager, is the thing that is almost at the top always on CPU usage, turns out for me, bun is not production ready at all.

Wish Zed editor had something like BigPickle which is free to use without limits.

reply
Is this sort of setup tenable on a consumer MBP or similar?
reply
Qwen’s 30B models run great on my MBP (M4, 48GB) but the issue I have is cooling - the fan exhaust is straight onto the screen, which I can’t help thinking will eventually degrade it, given the thermal cycling it would go through. A Mac Studio makes far more sense for local inference just for this reason alone.
reply
For a 30B model, you want at least 20GB of VRAM and a 24GB MBP can’t quite allocate that much of it to VRAM. So you’d want at least a 32GB MBP.
reply
I have 24GB VRAM available and haven't yet found a decent model or combination. Last one I tried is Qwen with continue, I guess I need to spend more time on this.
reply
It's a MoE model so I'd assume a cheaper MBP would simply result in some experts staying on CPU? And those would still have a sizeable fraction of the unified memory bandwidth available.
reply
I haven’t tried this myself yet but you would still need enough non-vram ram available to the cpu to offload to cpu, right? This is a fully novice question, I have not ever tried it.
reply
Is there any model that practically compares to Sonnet 4.6 in code and vision and runs on home-grade (12G-24G) cards?
reply
im currently running a custom Gemma4 26b MoE model on my 24gb m2... super fast and it beat deepseek, chatgpt, and gemini in 3 different puzzles/code challenges I tested it on. the issue now is the low context... I can only do 2048 tokens with my vram... the gap is slowly closing on the frontier models
reply
The Mac Minis (probably 64GB RAM) are the most cost effective.
reply
How are you running it with opencode, any tips/pointers on the setup?
reply
GLM 5.1 via an infra provider. Running a competent coding capable model yourself isn't viable unless your standards are quite low.
reply
What infra providers are there?
reply
There's DeepInfra. There's also OpenRouter where you can find several providers.
reply
I am using GLM 5.1 and MiniMax 2.7.
reply