Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving

upvote

Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving

(qwen.ai)

629 points

by mfiguiere20 hours ago |

upvote

by alex7o18 hours ago|

[-]

Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.

reply

upvote

by mikenew11 hours ago|

[-]

GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.

Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.

reply

upvote

by operatingthetan11 hours ago|

[-]

It seems like people can't even agree which SOTA model is best at any given moment anymore, so yeah I think it's just subjective at this point.

reply

upvote

by fwipsy11 hours ago|

[-]

Perhaps not even necessarily subjective, just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

reply

upvote

by easygenes6 hours ago|

[-]

Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).

The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).

All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.

reply

upvote

by operatingthetan10 hours ago|

[-]

>just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.

reply

upvote

by hamdingers9 hours ago|

[-]

And the subjectivity is bidirectional.

People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.

reply

upvote

by ulfw4 hours ago|

[-]

AI is a complete commodity

One model can replace another at any given moment in time.

It's NOT a winner-takes-all industry

and hence none of the lofty valuations make sense.

the AI bubble burst will be epic and make us all poorer. Yay

reply

upvote

by mettamage2 hours ago|

[-]

Hmm

Will try it out. Thanks for sharing!

reply

upvote

by abustamam10 hours ago|

[-]

What is your workflow? Do you use Cursor or another tool for code Gen?

reply

upvote

by mikenew4 hours ago|

[-]

I use Opencode, both directly and through Discord via a little bridge called Kimaki.

https://github.com/remorses/kimaki

reply

upvote

by LoganDark10 hours ago|

[-]

The value in Claude Code is its harness. I've tried the desktop app and found it was absolutely terrible in comparison. Like, the very nature of it being a separate codebase is already enough to completely throw off its performance compared to the CLI. Nuts.

reply

upvote

by deaux8 hours ago|

[-]

> The value in Claude Code is its harness

If this was the case then Anthropic would be in a very bad spot.

It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.

Pi is better than CC as a harness in almost every respect.

reply

upvote

by enochthered7 hours ago|

[-]

Anthropic limiting Claude subs to Claude code is what pushed me away in the end because I wanted to keep using Pi.

reply

upvote

by strel0k16 hours ago|

[-]

Just sign up for an AWS account and use the Anthropic models through Bedrock which Pi can use.

reply

upvote

by deaux16 minutes ago|

[-]

What advantage are you saying this has compared to just directly going through the Anthropic provider? They are the same price.

reply

upvote

by seunosewa6 hours ago|

[-]

API costs are really high compared to subs.

reply

upvote

by adrianN6 hours ago|

[-]

Why use tricks to support a company that is hostile to your use case?

reply

upvote

by bizzletk6 hours ago|

[-]

Can you enumerate why?

reply

upvote

by deaux3 hours ago|

[-]

- Claude Code has repeatedly had enormous token wastage bugs. Its agent interactions are also inefficient. These are the cause of many of the reports of "single prompt blew through 5-hour quota" even though it's a reasonable prompt.

- It still lacks support for industry standards such as AGENTS.md

- Extremely limited customization

- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.

- Obvious one: can't easily switch between Claude and non-Claude models

- Resource usage

More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.

reply

upvote

by Mashimo4 hours ago|

[-]

I thought the desktop app used the cli app in the background?

reply

upvote

by bink-lynch10 hours ago|

[-]

I have been using GLM-5.1 with pi.dev through Ollama Cloud for my personal projects and I am very happy with this setup. I use pi.dev with Claude Sonnet/Opus 4.6 at work. Claude Code is great but the latest update has me compacting so much more frequently I could not stand it. I don't miss MCP tool calling when I am using pi.dev; it uses APIs just fine. I actually think GML-5.1 builds better websites than Claude Opus. For my personal projects I am building a full stack development platform and GLM-5.1 is doing a fantastic job.

reply

upvote

by zackify7 hours ago|

[-]

I'm using pi the same as you. However, I have an MCP I need to use and the popular extension for that support works fine for me.

Really liking pi and glm 5.1!

reply

upvote

by jadbox9 hours ago|

[-]

Why use ollama cloud versus like Openrouter?

reply

upvote

by bink-lynch3 hours ago|

[-]

The limits seem higher on Ollama Cloud to me than paying for API access. I don't have solid stats on that though. I have an OpenRouter account and the service I am creating is going to need to use that. I will have better measuring stick then.

reply

upvote

by zackify7 hours ago|

[-]

Recently it had great limits but this month I'm trying open router directly.

reply

upvote

by jxmesth16 hours ago|

[-]

The only reason I'm stuck with Claude and Chatgpt is because of their tool calling. They do have some pretty useful features like skills etc. I've tried using qwen and deepseek but they can't even output documents. How are you guys handling documents and excels with these tools? I'd love to switch tbh.

reply

upvote

by embedding-shape16 hours ago|

[-]

> I've tried using qwen and deepseek but they can't even output documents

What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.

reply

upvote

by jxmesth16 hours ago|

[-]

Sorry for the confusion, I was actually talking about their Web based chat. Since most of my work is governance and docs, I just use their Web chats and they just refuse to output proper documents like Claude or Chatgpt do.

reply

upvote

by embedding-shape16 hours ago|

[-]

Aha... Well, I let Codex (Claude Code would work too) manage/troubleshoot .xlsx files too, seems to handle it just fine (it tends to un-archive them and browse the resulting XML files without issues), seen it do similar stuff for .app and .docx files too so maybe give that a try with other harnesses/models too, they might get it :)

reply

upvote

by jxmesth5 hours ago|

[-]

Yeah, it's just way easier to do via the web/mobile app but I'll give using it via the CLI a try. Thanks :)

reply

upvote

by noduerme14 hours ago|

[-]

You're not giving an AI command line access to your work computer? How do you expect to keep up? /s

reply

upvote

by dymk14 hours ago|

[-]

You give it command line access in a VM...

reply

upvote

by ycui198610 hours ago|

[-]

i give it in real ubuntu, no vm, no docker. so long I don't ask it to organize files, it will behave. it has not screw me so far.

reply

upvote

by dymk9 hours ago|

[-]

Godspeed

reply

upvote

by koen_hendriks12 hours ago|

[-]

You mean a VM like the one that contains a 0day that can escape the sandbox that gets found every year at pwn2own?

reply

upvote

by enneff12 hours ago|

[-]

Presumably you’re also using a browser to view this web page. There have also been vulnerabilities in that. You have to draw a line somewhere.

reply

upvote

by andai11 hours ago|

[-]

I run mine as a separate unprivileged user. (No VM.) Am I pwned?

reply

upvote

by dymk9 hours ago|

[-]

Maybe, but the sort of 0days you're talking about aren't exploited in any meaningful way for almost all developers.

reply

upvote

by arcanemachiner7 hours ago|

[-]

"Seatbelts don't save the life of everyone who gets into an accident, so why bother wearing one?"

reply

upvote

by chillfox9 hours ago|

[-]

You can make a harness fully functional with just the "shell_exec" tool if you give it access to a linux/unix environment + playwright cli.

reply

upvote

by ecocentrik16 hours ago|

[-]

When was the last time you used Qwen models? Their 3.5 and 3.6 models are excellent with tool calling.

reply

upvote

by jxmesth16 hours ago|

[-]

I gave it a try a few weeks ago tbh, I'll give it another shot tho. I mainly use their Web chats since that's easier to use and previously, qwen, deepseek, kimi, all were unable to output proper docx files or use skills.

reply

upvote

by ecocentrik15 hours ago|

[-]

Try loading the models up in a coding harness like Claude Code. There's a few docx skills listed on Vercel's skill index.

https://skills.sh/tfriedel/claude-office-skills/docx

reply

upvote

by ycui19867 hours ago|

[-]

outputting docx files does not have much to do with model capability. it is about whether tool calling has be configured .

reply

upvote

by zrn90056 minutes ago|

[-]

You can just use Cline in VSCode to get most of the tooling you need - it works with all models. Including Xiaomi's new Mimo with 1m context window and blazing fast speed. It's much cheaper than Claude's biggest plan and with much, much more quota.

reply

upvote

by sscaryterry15 hours ago|

[-]

You can use GLM-5.1 with claude code directly, I use ccs, GLM-5.1 setup as plan, but goes via API key.

reply

upvote

by NobleLie11 hours ago|

[-]

Yep Claude Code CLI does A LOT (which is now confirmed even more)

reply

upvote

by ycui198610 hours ago|

[-]

qwen3.5 and qwen3.6 are both good at tool calling.

reply

upvote

by jwitthuhn16 hours ago|

[-]

I've been using qwen-code (the software, not to be confused with Qwen Code the service or Qwen Coder the model) which is a fork of gemini-cli and the tool use with Qwen models at least has been great.

reply

upvote

by estimator729214 hours ago|

[-]

You can use both codex and Claude CLI with local models. I used codex with Gemma4 and it did pretty well. I did get one weird session where the model got confused and couldn't decide which tools actually existed in its inventory, but usually it could use tools just fine.

reply

upvote

by Moosdijk16 hours ago|

[-]

I wonder why glm is viewed so positively.

Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.

reply

upvote

by pkulak16 hours ago|

[-]

I've been running Opus and GLM side-by side for a couple weeks now, and I've been impressed with GLM. I will absolutely agree that it's slow, but if you let it cook, it can be really impressive and absolutely on the level of Opus. Keep in mind, I don't really use AI to build entire services, I'm mostly using it to make small changes or help me find bugs, so the slowness doesn't bother me. Maybe if I set it to make a whole web app and it took 2 days, that would be different.

The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.

reply

upvote

by tasuki15 hours ago|

[-]

> The big kicker for GLM for me is I can use it in Pi, or whatever harness I like.

Yes, but... isn't the same true for Opus and all the other models too?

reply

upvote

by slopinthebag15 hours ago|

[-]

Opus is about 7 times more expensive than GLM with API pricing. And since you can only use the Opus subscription plan in CC, you're essentially locked into API pricing for Pi and any other harness.

So you're either paying $1000's for Opus in Pi, or $30/month for GLM in Pi. If the results are mostly equivalent that's an easy choice for most of us.

reply

upvote

by tasuki14 hours ago|

[-]

Perhaps I'm being extremely daft: If the API is 7 times more expensive, then why is it $1000 vs $30? Or is there a GLM subscription one can use with Pi? Certainly not available in my (arguably outdated) Pi.

reply

upvote

by RussianCow14 hours ago|

[-]

I'm not the OP, but it's the latter. I'm currently using the "Lite" GLM subscription with OpenCode, for example. I'm not using it very heavily, but I haven't come close to hitting the limits, whereas I burned through my weekly limits with Claude very regularly.

reply

upvote

by bink-lynch9 hours ago|

[-]

I am using GLM-5.1 in pi.dev through Ollama Cloud. I am able to get by on the $20 plan. I use it a lot and the reset is hourly for sessions and weekly overall. This is the first week I got close to the limit before reset at about 85% used. I am probably using it about 4 hours a day on average 6 or 7 days per week.

reply

upvote

by girvo13 hours ago|

[-]

You can use GLM’s coding plan in Pi, just use the anthropic API instead of the OpenAI compatible one they give.

reply

upvote

by probst13 hours ago|

[-]

Or tell pi to add support for the coding plan directly. That gave me GLM-5.1 support in no time along with support for showing the remaining quota, etc, too.

It also compresses the context at around 100k tokens.

In case anyone is interested: https://github.com/sebastian/pi-extensions/tree/main/.pi/ext...

reply

upvote

by Mashimo16 hours ago|

[-]

I have used GLM 4.7, 5 and 5.1 now for about 3 month via OpenCode harness and I don't remember it every being stuck in a loop.

You have to keep it below ~100 000 token, else it gets funny in the head.

I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.

reply

upvote

by Mashimo2 hours ago|

[-]

EDIT: Ok, now I tried GLM for the first time in the morning CET, and it was .. bad. The reasoning took 5 mintues for a very very small .html file going around in circles.

Evening CET experience for me is super smooth.

reply

upvote

by gck112 hours ago|

[-]

That's unfortunate. 70-80k tokens is roughly the point where I start wrapping up with giving agent required context even on the small to medium sized requests.

That would leave almost no tokens for actual work

reply

upvote

by chillfox1 hours ago|

[-]

GLM is the first open source model that actually worked for me, where I found the output ok.

And yes, sonnet/opus is better and what I use daily. But I wouldn’t be that upset if I had to drop down to GLM.

reply

upvote

by Akira136416 hours ago|

[-]

IDK about GLM but GPT 5.4 Extra High has been great when I've used it in the VS Code Copilot extension, I see no actual reason Opus should consume 3x more quota than it the way it does

reply

upvote

by spaceman_202014 hours ago|

[-]

I think it offers a very good tradeoff of cost vs competency

4.7 is better, but its also wildly expensive

reply

upvote

by slopinthebag16 hours ago|

[-]

You're probably just holding it wrong.

reply

upvote

by blurbleblurble5 hours ago|

[-]

Opus 4.6 was incredible but Opus 4.7 is genuinely frustrating to me so far. It's really sharp but can be so lazy. It's constantly telling me that we should save this for tomorrow, that it's time for bed (in the middle of the day), and very often quite sloppy and bold in its action. These adjustments are getting old. The next crop of open models seems ready to practically replace the big ones as sharp orchestrator agents.

reply

upvote

by chillfox1 hours ago|

[-]

I have never seen a model be “lazy” before (I have seen them go for minimal change). I have been using the models through the api with various agents and no custom system prompt.

So I am curious, how do people get these lazy outputs?

Is it by having one of those custom system prompts that basically tells the model to be disrespectful?

Or is it free tier?

Cheap plans?

reply

upvote

by enraged_camel1 hours ago|

[-]

I have seen some people complain about a new tendency where it can suggest wrapping up the current task even though it isn't done yet. I haven't seen it myself though.

reply

upvote

by szundi1 hours ago|

[-]

[dead]

reply

upvote

by ternaryoperator17 hours ago|

[-]

The models test roughly equal on benchmarks, with generally small differences in their scores. So, it’s reasonable to choose the model based on other criteria. In my case, I’d switch to any vendor that had a decent plugin for JetBrains.

reply

upvote

by ezekiel6815 hours ago|

[-]

Qwen3-Coder produced much better rust code (that utilized rust's x86-64 vectorized extensions) a few months ago than Claude Opus or Google Gemini could. I was calling it from harnesses such as the Zed editor and trae CLI.

I was very impressed.

reply

upvote

by gck112 hours ago|

[-]

I think claude in general, writes very lazy, poor quality code, but it writes code that works in fewer iterations. This could be one of the reasons behind it's popularity - it pushes towards the end faster at all costs.

Every time codex reviews claude written rust, I can't explain it, but it almost feels like codex wants to scream at whoever wrote it.

reply

upvote

by lambda9 hours ago|

[-]

Their latest, Qwen3.6 35B-A3B is quite capable, and fast and small enough I don't really feel constrained running it locally. Some of the others that I've run that seem reasonably good, like Gemma 4 31B and Qwen3.5 122B-A10B just feel a bit too slow, or OOM my system too often, or run up on cache limits so spend a lot of time re-processing history. But the latest Qwen3.6 is both quite strong, and lightweight enough that it feels usable on consumer hardware.

reply

upvote

by justincormack15 hours ago|

[-]

Codex is pretty good at Rust with x86 and arm intrinsics too, it replaced a bunch of hand written C/assembly code I was using. I will try Qwen and Kimi on this kind of task too.

reply

upvote

by sirnicolaz15 hours ago|

[-]

Consider that SWE benchmarking is mainly done with python code. It tells something

reply

upvote

by cornedor17 hours ago|

[-]

I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin

reply

upvote

by dev_l1x_be16 hours ago|

[-]

Do you use Opus through the API or with subscription? Did you use OpenCode or Code?

reply

upvote

by cornedor15 hours ago|

[-]

Opus trough Claude Code, the Chinese models trough OpenCode Go, which seems like a great package to test them out.

reply

upvote

by odie553314 hours ago|

[-]

If you showed me code from GLM 5.1, Opus 4.6, and Kimi K2.6, my ranking for best model would be highly random.

reply

upvote

by mkhalil3 hours ago|

[-]

Not to mention, that Opus cost orders of magnitude more money. These are VERY impressive and usage.

FAANGS love to give away money to get people addicted to their platforms, and even they, the richest companies in the world, are throttling or reducing Opus usage for paying members, because even the money we pay them doesn't cover it.

Meanwhile, these are usable on local deployments! (and that's with the limited allowance our AI overlords afford us when it comes to choices for graphics cards too!)

reply

upvote

by FlyingSnake17 hours ago|

[-]

I tried GLM5.1 last week after reading about it here. It was slow as molasses for routine tasks and I had to switch back to Claude. It also ran out of 5H credit limit faster than Claude.

reply

upvote

by bensyverson17 hours ago|

[-]

If you view the "thinking" traces you can see why; it will go back and forth on potential solutions, writing full implementations in the thinking block then debating them, constantly circling back to points it raised earlier, and starting every other paragraph with "Actually…" or "But wait!"

reply

upvote

by nothinkjustai17 hours ago|

[-]

I see this with Opus too.

reply

upvote

by girvo13 hours ago|

[-]

Indeed. And that’s with Anthropic hiding reading traces unlike these other comparisons.

reply

upvote

by FlyingSnake17 hours ago|

[-]

> "Actually…" or "But wait!"

You’re absolutely right!

Jokes apart, I did notice GLM doing these back and forth loops.

reply

upvote

by tonyarkles16 hours ago|

[-]

I was watching Qwen3.6-35B-A3B (locally) doing the same dance yesterday. It eventually finished and had a reasonable answer, but it sure went back and forth on a bunch of things I had explicitly said not to do before coming to a conclusion. At least said conclusion was not any of the things I'd said not to do.

reply

upvote

by Lerc15 hours ago|

[-]

That is essentially what the reasoning reinforcement training does. It is getting the model to say things that are more likely to result in the correct final answer. Everything it does in between doesn't necessarily need to be valid argument to produce the answer. You can think of it as filling the context with whatever is needed to make the right answer come out next. Valid arguments obviously help. but so might expressions of incorrect things that are not obviously untrue to the model until it sees them written out. The What's The Magic Word paper shows how far that could go. If the policy model managed to learn enough magic words it would be theoretically possible to end up with an LLM that spouts utter gibberish until delivering the correct answer seemingly out of the blue.

reply

upvote

by tonyarkles14 hours ago|

[-]

That's pretty cool, thanks for the extra context! (pardon the... not even pun I guess)

Also, thanks for pointing me at that specific paper; I spend a lot more of my life closer to classical control theory than ML theory so it's always neat to see the intersection of them. My unsubstantiated hypothesis is that controls & ML are going to start getting looked at more holistically, and not in the way I normal see it (which is "why worry about classical control theory, just solve the problem with RL"). Control theory is largely about steering dynamic systems along stable trajectories through state space... which is largely what iterative "fill in the next word" LLM models are doing. The intersection, I hope, will be interesting and add significant efficiency.

reply

upvote

by nothinkjustai17 hours ago|

[-]

Z.ai’s cloud offering is poor, try it with a different provider.

reply

upvote

by complexworld5 hours ago|

[-]

could you add some context for why you think it's poor?

reply

upvote

by dev_l1x_be16 hours ago|

[-]

Benchmarking is grossly misleading. Claude’s subscription with Code would not score this high on the benchmarks because how they lobotomized agentic coding.

reply

upvote

by solomatov16 hours ago|

[-]

>but I have seen the local 122b model do smarter more correct things based on docs than opus

Could you please share more about this

reply

upvote

by alex7o13 hours ago|

[-]

Maybe a bit misleading. I have used in in two places.

One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.

Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file

reply

upvote

by OtomotO18 hours ago|

[-]

Many people averted religion (which I can get behind with), but have never removed the dogmatic thinking that lay at its root.

As so many things these days: It's a cult.

I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6

But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"

For me it's just a tool, so I shrug.

reply

upvote

by balls18718 hours ago|

[-]

> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.

reply

upvote

by runarberg17 hours ago|

[-]

I wonder about this. I see two obvious possibilities (if we ignore bias):

1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.

2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.

reply

upvote

by ehnto17 hours ago|

[-]

I definitely find your last point is true for me. The more work I am doing with AI the more I am expecting it to do, similar to how you can expect more over time from a junior you are delegating to and training. However the model isn't learning or improving the same way, so your trust is quickly broken.

As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

reply

upvote

by tonyarkles16 hours ago|

[-]

> However the model isn't learning or improving the same way, so your trust is quickly broken.

One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.

> similar to how you can expect more over time from a junior you are delegating to and training

That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.

> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.

reply

upvote

by svnt17 hours ago|

[-]

Your version of the last point is a bit softer I think — parent was putting it down to “loss of talent” but yours captures the gaps vs natural human interaction patterns which seems more likely, especially on such short timescales.

reply

upvote

by runarberg17 hours ago|

[-]

I confusingly say both. First I say that the ratio of work coming from the model is increasing, and when I am clarifying I say “your talent keeps deteriorating”. You correctly point out these are distinct, and maybe this distinction is important, although I personally don‘t think so. The resulting code would be the same either way.

Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.

reply

upvote

by rescbr14 hours ago|

[-]

I don’t think the providers intentionally nerf the models to make the new one look better. It’s a matter of them being stingy with infrastructure, either by choice to increase profit and/or sheer lack of resources to keep n+1 models deployed in parallel without deprecating older ones when a new one is released.

I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.

reply

upvote

by flux312516 hours ago|

[-]

Point 2 is so true, I definitely find myself spending more time reading code vs writing it. LLMs can teach you a lot, but it's never the same as actually sitting down and doing it yourself.

reply

upvote

by e12e17 hours ago|

[-]

I think it might have to do with how models work, and fundamental limits with them (yes, they're stochastic parrots, yes they confabulate).

Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).

But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.

But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.

Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...

In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.

But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.

When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.

reply

upvote

by taurath18 hours ago|

[-]

I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.

In the same way, it’s hard to see how people who say they’re struggling are actually using it.

There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.

reply

upvote

by balls18717 hours ago|

[-]

Well summarized.

We're also seeing that the people up top are using this to cull the herd.

reply

upvote

by psychoslave18 hours ago|

[-]

What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.

At some point the is a need to have faith in some stable enough ground to be able to walk onto.

reply

upvote

by Wolfbeta16 hours ago|

[-]

Who controls that need for you?

reply

upvote

by ecshafer18 hours ago|

[-]

All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.

reply

upvote

by smallmancontrov16 hours ago|

[-]

All people think dogmatically, but religion does not prevent people from acting dogmatically in politics, sports, etc. It just doesn't. It never did.

Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.

reply

upvote

by bensyverson17 hours ago|

[-]

Allow me to introduce you to Buddhism

reply

upvote

by ecshafer17 hours ago|

[-]

Elaborate. Buddhism is going to have the same epistemological issues as anything, since its a human consciousness issue.

reply

upvote

by bensyverson15 hours ago|

[-]

> since its a human consciousness issue

I'd encourage you to check it out for yourself. It's certainly possible to be a dogmatic Buddhist, but one of the foundational beliefs of Buddhism is that the type of dogmatic attachment you're describing is avoidable. It's not easy, but that's why you meditate.

reply

upvote

by tauroid16 hours ago|

[-]

https://en.wikipedia.org/wiki/Prat%C4%ABtyasamutp%C4%81da

reply

upvote

by svnt17 hours ago|

[-]

Which one?

reply

upvote

by bensyverson17 hours ago|

[-]

Zen

reply

upvote

by svnt16 hours ago|

[-]

The Western Zen? In my experience it is downgraded from being a religion to being a system of practice which relieves it of the broader Mahayana cosmology. But I would suggest the dogma is less obvious but still there, often just somewhere else, such as in its own limitations, or in a philosophical container at a higher level such as scientism.

reply

upvote

by bensyverson15 hours ago|

[-]

All Zen is about releasing those attachments. Granted it's pretty hard, because if you succeed, you're enlightened.

East, West, Religion, Practice… From a Zen perspective, you're just troubling your mind with binaries and conflict.

reply

upvote

by svnt14 hours ago|

[-]

Ah and there is the dogma -- the otherness of the enlightened.

The binaries still functionally exist. I see a lot of value in reflective practices. At the same time it seems unlikely to me that the point of existing is to not trouble your mind.

reply

upvote

by bensyverson13 hours ago|

[-]

There's a saying in Zen: if you meet the buddha on the road, kill him. The point being, the very exaltation of enlightenment is an impediment.

If Buddhism can be said to have a goal, it is to reduce suffering (including your own), so troubling your own mind is indeed something it can help with. The point of existence would be something interesting to meditate on. If you discover it, let us all know!

reply

upvote

by svnt6 hours ago|

[-]

This dancing between positions is all very defensible and if the path is currently working for you, more power to you.

Dogma, like the binaries, still functionally exists, whatever the narrative. If you can’t admit that, that might also be something interesting to meditate on.

Say you have eliminated all suffering. How many versions of that world exist? How many of them are true, beautiful, and good? See how, in order to evaluate the success or failure of Buddhism, we have to move beyond “eliminate suffering” to a higher value standard?

reply

upvote

by OtomotO17 hours ago|

[-]

Dogmatism is a spectrum and for too many people it's on the animal side of the scale.

reply

upvote

by taneq17 hours ago|

[-]

I wonder to what degree it depends on how easy you find coding in general. I find for the early steps genAI is great to get the ball rolling, but rapidly it becomes more work to explain what it did wrong and how to fix it (and repeat until it does so) than to just fix the code myself.

reply

upvote

by slopinthebag9 hours ago|

[-]

Yes, this and also taste. What might be perfectly fine for one developer is an abomination for another who can spot the problems with it.

I think in every domain, the better you are the less useful you find AI.

reply

upvote

by redsocksfan4516 hours ago|

[-]

[dead]

reply

upvote

by seanw26515 hours ago|

[-]

Kimi K2.6 also released today. I think it's fair to compare the two models.

Qwen appears to be much more expensive:

- Qwen: $1.3 in / $7.8 out

- Kimi: $0.95 in / $4 out

--

The announcement posts only share two overlapping benchmark results. Qwen appears to score slightly lower on SWE-Bench Pro and Terminal-Bench 2.0.

Qwen:

- Teminal-Bench 2.0: 65.4

- SWE-Bench Pro: 57.3

Kimi:

- Terminal-Bench 2.0: 66.8

- SWE-Bench Pro: 58.6

--

Different models have different strong suits, and benchmarks don't cover everything. But from a numbers perspective, Kimi looks much more appealing.

reply

upvote

by archon8109 hours ago|

[-]

I wonder if this means a better Cursor Composer model update is coming, since it builds on top of Kimi K2.

reply

upvote

by mchusma13 hours ago|

[-]

i think as the pricing has gone up on the Chinese models it has made them less appealing, and with the introduction of Gemma-4 not many are at the pareto frontier (also in my experience, not just the stats): https://arena.ai/leaderboard/text/overall?viewBy=plot

reply

upvote

by ninjahawk119 hours ago|

[-]

The way to develop in this space seems to be to give away free stuff, get your name out there, then make everything proprietary. I hope they still continue releasing open weights. The day no one releases open weights is a sad day for humanity. Normal people won’t own their own compute if that ever happens.

reply

upvote

by culi18 hours ago|

[-]

I think that's an overgeneralization. We've seen all the American models be closed and proprietary from the start. Meanwhile the non-American (especially the Chinese ones) have been open since the start. In fact they often go the opposite direction. Many Chinese models started off proprietary and then were later opened up (like many of the larger Qwen models)

reply

upvote

by 38362936482 minutes ago|

[-]

GPT started off open? They just closed before anyone else even joined the space

reply

upvote

by robot_jesus18 hours ago|

[-]

> We've seen all the American models be closed and proprietary from the start

What about Gemma and Llama and gpt-oss, not to mention lots of smaller/specialized models from Nvidia and others?

I would never argue that China isn't ahead in the open weights game, of course, but it's not like it's "all" American models by any stretch.

reply

upvote

by walthamstow18 hours ago|

[-]

gpt-oss is good but I haven't heard anything about an update. It seems like one and done, to shut up people complaining about non-Open AI

reply

upvote

by InkCanon4 hours ago|

[-]

The more accurate version is only Chinese companies (plus Facebook briefly) really open source their frontier models. The rest are non frontier. They are either older or specialized for something.

reply

upvote

by 1dom3 hours ago|

[-]

It's all openwashing, all of the ones you listed at somepoint have expressed how important and valuable open weights and locally usable models are. Every single one of them has then increasingly focused and pushed closed, proprietary or cloud usable only options since saying/doing that.

I'm annoyed at myself, because I thought/hoped/praised chinese AI when they were opening up as Llama was closing, but Qwen looks to be doing the same playbook here as Llama/Meta, Gemma/Google and OpenAI/gpt-oss.

reply

upvote

by embedding-shape18 hours ago|

[-]

> We've seen all the American models be closed and proprietary from the start.

Most*.

OpenAI, contrary to popular belief, actually used to believe in open research and (more or less) open models. GPT1 and GPT2 both were model+code releases (although GPT2 was a "staged" release), GPT3 ended up API-only.

reply

upvote

by culi18 hours ago|

[-]

That's fair but those days seem so long gone now.

Also the Chinese models aren't following a typical American SaaS playbook which relies on free/cheap proprietary software for early growth. They are not just publishing their weights but also their code and often even publishing papers in Open Access journals to explicitly highlight what methods and advancements were made to accomplish their results

reply

upvote

by jfoster6 hours ago|

[-]

> those days seem so long gone now.

Well, Musk v OpenAI kicks off in one week from now with the objective of forcing them back to their roots. A jury will be deciding whether a nonprofit accepting $50m - $100m of donations and then discarding their mission for an IPO is OK or not. Should be interesting.

reply

upvote

by zozbot23418 hours ago|

[-]

The Nvidia Nemotron models are recent, and of course the Gemma 4 series from Google.

reply

upvote

by tasuki15 hours ago|

[-]

Any idea why they do that?

reply

upvote

by taneq17 hours ago|

[-]

gasp Science!

reply

upvote

by zozbot23418 hours ago|

[-]

OpenAI has released their GPT-OSS series more recently.

reply

upvote

by magicalhippo15 hours ago|

[-]

Recently, more like 20 years ago in LLM-years.

It's a good model though, would be nice with a refresh.

reply

upvote

by 18 hours ago|

[-]

deleted

reply

upvote

by visarga19 hours ago|

[-]

I think it is in the interest of chip makers to make sure we all get local models

reply

upvote

by qalmakka18 hours ago|

[-]

I think they're in a win-win situation. Big AI companies would love to see local computing die in favour of the cloud because they are well aware the moment an open model that can run on non ludicrous consumer hardware appears, they're screwed. In this situation Nvidia, AMD and the like would be the only ones profiting from it - even though I'm not convinced they'd prefer going back to fighting for B2C while B2B Is so much simpler for them

reply

upvote

by zozbot23418 hours ago|

[-]

If you want to run AI models at scale and with reasonably quick response, there's not many alternatives to datacenter hardware. Consumer hardware is great for repurposing existing "free" compute (including gaming PCs, pro workstations etc. at the higher end) and for basic insurance against rug pulls from the big AI vendors, but increased scale will probably still bring very real benefits.

reply

upvote

by qalmakka18 hours ago|

[-]

Currently, yes. But I don't find it hard to imagine that in a while we could get reasonably light open models with a level of reasoning similar to current opus, for instance. In such a scenario how many people would opt to pay for a way more expensive cloud subscription? Especially since lots of people are already not that interested in paying for frontier models nowadays where it makes sense. Unless keep on getting a constant, never ending stream of improvements we're basically bound to get to a point where unless you really need it you are ok with the basic, cheaper local alternative you don't have to pay for monthly.

reply

upvote

by zozbot23418 hours ago|

[-]

I think average users are already okay with the reasoning level they'd get with current open models. But the big AI firms have pivoted their frontier models towards the enterprise: coding and research, as opposed to general chat. And scale is quite important for these uses, ordinary pro hardware is not enough.

reply

upvote

by twoodfin18 hours ago|

[-]

This is really just a question of product design meeting the technology.

Today, lots of integer compute happens on local devices for some purposes, and in the cloud for others.

Same is already true for matmul, lots of FLOPS being spent locally on photo and video processing, speech to text, …

No obvious reason you wouldn’t want to specialize LLM tasks similarly, especially as long-running agents increasingly take over from chatbots as the dominant interaction architecture.

reply

upvote

by lelanthran5 hours ago|

[-]

> If you want to run AI models at scale and with reasonably quick response, there's not many alternatives to datacenter hardware.

Right now, certainly. Things change. What was a datacenter rack yesterday could be a laptop tomorrow.

reply

upvote

by BobbyJo18 hours ago|

[-]

At a consistent amount of usage, datacenters are at least an order of magnitude more hardware efficient. I'm sure Nvidia and AMD would be fine fighting for B2C if it meant volume would be 10+x.

Now, given they can't satisfy current volume, they are forced to settle for just having crazy margins.

reply

upvote

by qalmakka18 hours ago|

[-]

The problem with B2C is that you need to have leverage of some kind (more demanding applications, planned obsolescence, ...) in order to get people to keep on buying your product. The average consumer may simply consider themselves satisfied with their old product they already own and only replace it when it breaks down. On the contrary, with the cloud you can keep people hooked on getting the latest product whether they need it or not, and get artificial demand from datacentres and such.

reply

upvote

by try-working11 hours ago|

[-]

Future upgrade cycles on phones and laptops, PCs, will be driven by SOCs that embed some type of ASIC that run a specific model. Every 6 months there will be a new, better version to upgrade to, which will require a new device. This is how Apple will be able to reduce cycles from 3 years to 6-12 months.

reply

upvote

by BobbyJo16 hours ago|

[-]

I think businesses running datacenters are much less likely to frivolously buy the latest GPUs with no functional incentive than general consumers are...

reply

upvote

by ycui198610 hours ago|

[-]

There are also many Chines AI-target GPU/NPU producers. You can get a hold of some boards on taobao.com. They are usable in some way.

No, nVidia and AMD are not the only ones benefiting.

reply

upvote

by zozbot23419 hours ago|

[-]

Definitely. Many big hardware firms are directly supporting HuggingFace for this very reason.

reply

upvote

by ninjahawk119 hours ago|

[-]

True, chip companies have the opposite mindset, Nvidia is making their own open weights I believe

reply

upvote

by elorant18 hours ago|

[-]

This is obviously a strategic move at a national level. Keep publishing competing free models to erode the moat western companies could have with their proprietary models. As long as the narrative serves China there will be no turn to proprietary models.

reply

upvote

by Barrin928 hours ago|

[-]

>This is obviously a strategic move at a national level.

no it isn't. That's the kind of thing people say who've never worked in the Chinese software ecosystem. It's how the Chinese internet has worked for 20+ years. The Chinese market is so large and competition is so rabid that every company basically throws as much free stuff at consumers as they can to gain users. Entrepreneurs don't think about "grand strategic moves at the national level" while they flip through their copies of the Art of War and Confucius lol

reply

upvote

by elorant1 hours ago|

[-]

If this was true then they’d build services around those models and provide those for free or vastly cheaper than western competition. But that’s not what they’re doing. Instead they’re giving away the entire model for free. And by the way, Qwen isn’t build from some random entrepreneur who’s trying to solve the cold start problem, but from Alibaba which is a fucking behemoth. And surprisingly of course none of these models answer uncomfortable questions about China’s past. Because sure enough, the first thing any entrepreneur would think is to protect their government and their history. Sure, happens all the time, no state interference here, move on.

reply

upvote

by stingraycharles8 hours ago|

[-]

That has been a viable commercial strategy for most modern, funded businesses. Capture market share at a loss, then once name is established turn on the profit.

reply

upvote

by try-working11 hours ago|

[-]

Exactly. Open source is a commercial strategy for Chinese labs. They have no other effective way of marketing their models and inference services: https://try.works/writing-1#why-chinese-ai-labs-went-open-an...

reply

upvote

by baq19 hours ago|

[-]

Always has been, it’s literally saas; the slight difference is that the lowest tier subscriptions at the frontier labs are basically free trials nowadays, too

reply

upvote

by Zavora18 hours ago|

[-]

Its the new freeware model!

reply

upvote

by CamperBob218 hours ago|

[-]

I'm a little more optimistic than that. I suspect that the open-weight models we already have are going to be enough to support incremental development of new ones, using reasonably-accessible levels of compute.

The idea that every new foundation model needs to be pretrained from scratch, using warehouses of GPUs to crunch the same 50 terabytes of data from the same original dumps of Common Crawl and various Russian pirate sites, is hard to justify on an intuitive basis. I think the hard work has already been done. We just don't know how to leverage it properly yet.

reply

upvote

by thesz18 hours ago|

[-]

Change layer size and you have to retrain. Change number of layers and you have to retrain. Change tokenization and you have to retrain.

reply

upvote

by altruios17 hours ago|

[-]

Hopefully we will find a way to make it so that making minor changes don't require a full retrain. Training how to train, as a concept, comes to mind.

reply

upvote

by CamperBob217 hours ago|

[-]

And yet the KL divergence after changing all that stuff remains remarkably similar between different models, regardless of the specific hyperparameters and block diagrams employed at pretraining time. Some choices are better, some worse, but they all succeed at the game of next-token prediction to a similar extent.

To me, that suggests that transformer pretraining creates some underlying structure or geometry that hasn't yet been fully appreciated, and that may be more reusable than people think.

Ultimately, I also doubt that the model weights are going to turn out to be all that important. Not compared to the toolchains as a whole.

reply

upvote

by thesz15 hours ago|

[-]

That "underappreciated underlying structure or geometry" can be just an artifact of the same tokenization used with different models.

Tokenization breaks up collocations and creates new ones that are not always present in the original text as it was. Most probably, the first byte pair found by simple byte pair encoding algorithm in enwik9 will be two spaces next to each other. Is this a true collocation? BPE thinks so. Humans may disagree.

What does concern me here is that it is very hard to ablate tokenization artifacts.

reply

upvote

by dTal17 hours ago|

[-]

None of that is true, at least in theory. You can trivially change layer size simply by adding extra columns initialized as 0, effectively embedding your smaller network in a larger network. You can add layers in a similar way, and in fact LLMs are surprisingly robust to having layers added and removed - you can sometimes actually improve performance simply by duplicating some middle layers[0]. Tokenization is probably the hardest but all the layers between the first and last just encode embeddings; it's probably not impossible to retrain those while preserving the middle parts.

[0] https://news.ycombinator.com/item?id=47431671 https://news.ycombinator.com/item?id=47322887

reply

upvote

by thesz14 hours ago|

[-]

You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training?

What if you still have to obtain the best result possible for given coefficient/tokenization budget?

I think that my comment express general case, while yours provide some exceptions.

reply

upvote

by dTal2 hours ago|

[-]

The general case is that our own current relative ignorance on the best way to use and adapt pretrained weights is a short-lived anomaly caused by an abundance of funding to train models from scratch, a rapid evolution of training strategies and architectures, and a mad rush to ship hot new LLMs as fast as possible. But even as it is, the things you mentioned are not impossible, they are easy, and we are only going to get better at them.

>What if you need to reduce number of layers

Delete some.

> and/or width of hidden layers?

Randomly drop x% of parameters. No doubt there are better methods that entail distillation but this works.

> would the process of "layers to add" selection be considered training?

Er, no?

> What if you still have to obtain the best result possible for given coefficient/tokenization budget?

We don't know how to get "the best result possible", or even how to define such a thing. We only know how to throw compute at an existing network to get a "better" network, with diminishing returns. Re-using existing weights lowers the amount of compute you need to get to level X.

reply

upvote

by andriy_koval15 hours ago|

[-]

there is evidence it is useful in some cases, but obviously no evidence it is enough if you chase to beat SOTA.

reply

upvote

by pduggishetti18 hours ago|

[-]

I do not think it's common crawl anymore, its common crawl++ using paid human experts to generate and verify new content, weather its code or research.

I believe US is building this off the cost difference from other countries using companies like scale, outlier etc, while china has the internal population to do this

reply

upvote

by testbjjl19 hours ago|

[-]

Any reason for them to do this other than altruism? I don’t think this can be regulated.

reply

upvote

by Rohansi18 hours ago|

[-]

Bake ads into them.

reply

upvote

by WarmWash18 hours ago|

[-]

The Chinese state wants the world using their models.

People think that Chinese AI labs are just super cool bros that love sharing for free.

The don't understand it's just a state sponsored venture meant to further entrench China in global supply and logistics. China's VCs are Chinese banks and a sprinkle of "private" money. Private in quotes because technically it still belongs to the state anyway.

China doesn't have companies and government like the US. It just has government, and a thin veil of "company" that readily fool westerners.

reply

upvote

by subw00f18 hours ago|

[-]

As opposed to the US, which just has companies and a thin veil of “government”.

reply

upvote

by culi18 hours ago|

[-]

Also many of these Chinese companies aren't just opening their weights. They are open sourcing their code AND publishing detailed research papers alongside them to reveal how they accomplished what they accomplished.

That's very different from an American SaaS model which relies of free but proprietary software for early growth

reply

upvote

by zozbot23418 hours ago|

[-]

I'm not sure how local AI models are meant to "entrench China in global supply and logistics". The two areas have nothing to do with one another. You can easily run a Chinese open model on all-American hardware.

reply

upvote

by WarmWash18 hours ago|

[-]

They are building a pipeline, and the goal is to get people in the door.

If you forever stand at the entrance eating the free samples, that's fine, they don't care. Other people are going through the door and you are still consuming what they feed you. Doesn't mean it's going to be bad or evil, but they are staking their territory of control.

reply

upvote

by zozbot23418 hours ago|

[-]

Oh for sure, they're getting a whole lot of Chinese people and other non-Westerners through the door already - mostly, the people who are being ignored or even blocked outright by the big Western labs. That's territory we purposely abandoned, and they're going to control it by default.

reply

upvote

by devilsdata12 hours ago|

[-]

I'm Aussie. Please explain to me; why should I care whether Chinese SOEs or the US tech companies are winning? Neither have my best interests at heart.

reply

upvote

by jillesvangurp18 hours ago|

[-]

Like with nuclear technology, it's not healthy for only one country to dominate AI. The cat is already out of the bag and many countries now have the ability to train and run models. Silicon Valley has bootstrapped this space. But it should be noted that they are using AI talent from all over the world and it was sort of inevitable that this technology would get around. Lots of Chinese, Indian, Russian, and Europeans are involved.

As for what comes next, it's probably going to be a bit of a race for who can do the most useful and valuable things the cheapest. If OpenAI and Anthropic don't make it, the technology will survive them. If they do, they'll be competing on quality and cost.

As for state sponsorship, a lot of things are state sponsored. Including in the US. Silicon Valley has a rich history that is rooted in massive government funding programs. There's a great documentary out there the secret history of Silicon Valley on this. Not to mention all the "cheap" gas that is currently powering data centers of course comes on the back of a long history of public funding being channeled into the oil and gas industry.

reply

upvote

by WarmWash18 hours ago|

[-]

>As for state sponsorship, a lot of things are state sponsored.

You can make any comparison you want if you use adjectives rather than values. I can say that cars use a massive amount of water (all those radiators!) to try and downplay agricultural water usage. But its blatantly disingenuous.

SV is overwhelmingly private (actual constitutional private) money. To the point that you should disregard people saying otherwise, just like you would the people saying cars use massive amounts of water.

reply

upvote

by OtomotO18 hours ago|

[-]

So an OPEN model that I can run on my own fucking hardware will entrench China in global supply and logistics how?

Contrary: How will the closed, proprietary models from Anthropic, "Open"AI and Co. lead us all to freedom? Freedom of what exactly? Freedom of my money?

At some point this "anti-communism" bullshit propaganda has to stop. And that moment was decades ago!

reply

upvote

by Zetaphor18 hours ago|

[-]

Anything that isn't explicitly to the benefit of US interests must be against them /s

reply

upvote

by grttsww18 hours ago|

[-]

So what?

I still prefer that over US total dominance.

Let them fight it out.

reply

upvote

by joquarky17 hours ago|

[-]

Yeah, a lot of people are still living within the paradigm of tribalism: my team good, other team bad.

But the events of the past decade or so have clearly demonstrated that there are no "good" actors.

I personally couldn't care less who wins in the China vs US AI competition, both sides have a long list of pros and cons.

reply

upvote

by spwa418 hours ago|

[-]

I'd get a bit informed about what exactly Chinese dominance entails. Ask a few Uyghurs, Cantonese Hong Kongers, or even Tibetans.

Then decide ...

reply

upvote

by joquarky17 hours ago|

[-]

Ask a few Native Americans about dominance.

Or maybe families of African descent.

Or maybe families of Japanese Americans who lived in the US during WWII.

Or maybe people of Latin descent living in the US today.

reply

upvote

by jazz9k17 hours ago|

[-]

The US examples you just gave happened decades (and in some cases hundreds) of years ago. The difference is that it's happening in China right now, and nobody cares.

You really don't see the difference?

reply

upvote

by well_ackshually16 hours ago|

[-]

The US is the biggest threat to the world right now, and is actively supporting a genocide in Palestine as well as war crimes in Lebanon.

I'm perfectly happy to let the chinese get a piece of the pie and fight the US, no matter how bad they are right now.

reply

upvote

by grttsww7 hours ago|

[-]

What a delusional dumb ass you are

reply

upvote

by darkwater18 hours ago|

[-]

Well, isn't this what the US and really any other power in the world has always done, since forever?

reply

upvote

by ai_fry_ur_brain17 hours ago|

[-]

Why is it sad? These things are useles all around, along with the people who overuse them.

It would be a great day for humanity if people would stopping glazing text autocomplete as revolutionary.

reply

upvote

by 0xbadcafebee19 hours ago|

[-]

Everybody's out here chasing SOTA, meanwhile I'm getting all my coding done with MiniMax M2.5 in multiple parallel sessions for $10/month and never running into limits.

reply

upvote

by Aurornis19 hours ago|

[-]

For serious work, the difference between spending $10/month and $100/month is not even worth considering for most professional developers. There are exceptions like students and people in very low income countries, but I’m always confused by developers with in careers where six figure salaries are normal who are going cheap on tools.

I find even the SOTA models to be far away from trustworthy for anything beyond throwaway tasks. Supervising a less-than-SOTA model to save $10 to $100 per month is not attractive to me in the least.

I have been experimenting with self hosted models for smaller throwaway tasks a lot. It’s fun, but I’m not going to waste my time with it for the real work.

reply

upvote

by zozbot23419 hours ago|

[-]

You need to supervise the model anyway, because you want that code to be long-term maintainable and defect free, and AI is nowhere near strong enough to guarantee that anytime soon. Using the latest Opus for literally everything is just a huge waste of effort.

reply

upvote

by senordevnyc16 hours ago|

[-]

Yes, but I find supervision much easier and faster with a strong model. It makes fewer dumb mistakes that I have to catch and correct, and it’ll follow my instructions more reliably.

reply

upvote

by jatins5 hours ago|

[-]

Depends on the task. If it's something that occurs a lot in training data like React/tailwind code then I don't think you need SOTA. Most reasoning models since Sonnet 3.5, Deepseek 3.1 et al will do fine for those tasks.

Doesn't justify 10x the cost in that case imo

reply

upvote

by dandaka18 hours ago|

[-]

Waste of effort... of Opus? If "Opus effort" is cheaper, than dev hours managing yourself more dumb/effective model, what is the point?

reply

upvote

by cyanydeez18 hours ago|

[-]

rich people dont concern themselves with the cost of tokens.

reply

upvote

by dnnddidiej13 hours ago|

[-]

It is not even rich. If you earn more than $30k it is worth your employer spending $3k on AI tools.

reply

upvote

by 0xbadcafebee11 hours ago|

[-]

You don't magically get better results by spending 10x more on a model. If your prompt is crap and harness is crap, you get crap results, regardless of model. And if you run into limits, you aren't working at all.

Buying the most expensive circular saw doesn't get you the best woodworking, but it is the most expensive woodworking.

reply

upvote

by itake9 hours ago|

[-]

Not really true. Remember the prompt engineering craze a few years ago with crazy complex prompt composers (langchain) that don’t need to exist any more because the underlying model got so much better at understanding what the humans are actually asking for?

reply

upvote

by 0xbadcafebee3 hours ago|

[-]

A model cannot read your mind. It can guess, and those guesses are more likely to be wrong if you don't give it the right input, and model performance gets worse if not steered/curated properly. The output depends on the input.

https://medium.com/@adambaitch/the-model-vs-the-harness-whic... | https://aakashgupta.medium.com/2025-was-agents-2026-is-agent... | https://x.com/Hxlfed14/status/2028116431876116660 | https://www.langchain.com/blog/the-anatomy-of-an-agent-harne...

(I don't think anecdotes are useful in these comparisons, but I'll throw mine in anyway: I use GPT-5.4, GPT-5.3-Codex, Gemini-3-Pro, Opus, Sonnet, at work every week. I then switch to GLM-5.1, K2-Thinking. Other than how chatty they get, and how they handle planning, I get the same results. Sometimes they're great, sometimes I spent an hour trying to coax them towards the solution I want. The more time I spend describing the problem and solution and feeding them data, the better the results, regardless of model. The biggest problem I run into lately is every website in the world is blocking WebFetch so I have to manually download docs, which sucks. And for 90% of my coding and system work, I see no difference between M2.5 and SOTA models, because there's only so much better you can get at writing a simple script or function or navigating a shell. This is why Anthropic themselves have always told people to use Sonnet to orchestrate complex work, and Haiku for subagents. But of course they want you to pay for Opus, because they want your money.)

reply

upvote

by slopinthebag14 hours ago|

[-]

$100 / month will get you rate limited to much to rely on with the Claude plans. People still report getting rate limited on the $200 / plan.

Also not everyone wants to use Claude Code, so if they're paying API pricing it's more likely thousands of dollars a month. If you can get the same results by spending a fraction of that, why wouldn't you?

reply

upvote

by chillfox53 minutes ago|

[-]

Managing context size and efficient token usage is a skill.

I have an Anthropic API key for work, and if I use sonnet/opus all day for agent coding, it ends up costing about ~$25.

I am going to need more cpu/ram to run multiple agents in parallel to spend much more than that.

reply

upvote

by esperent8 hours ago|

[-]

I got rate limited within an hour on the $200 while working on a single feature.

That was the breaking point, I cancelled my subscription.

As it happens I had a low coding workload over the past two weeks so I've been noodling around in PI mostly with Gemini Flash api. I like it - I even agree it's a much better harness than CC. However, the lock in is real. Even without switching models which each have their own quirks, I expect my work speed to drop drastically for at least a week or two even if I was focused on it fully. But after the learning period I think pi will be faster. The danger of course is that CC is fairly on rails while with PI you could end up spending all your time tinkering with the harness.

reply

upvote

by gck112 hours ago|

[-]

And people report getting limited on the $200 plan is putting it very mildly.

You can't do any serious work on it without rationing your work and kneecapping your workflows, to the point where you design workflows around anthropic usage limit woodoo rather than what actually works.

Without this, I run into WEEKLY usage limits on $200 plan, working on a single codebase, one feature at a time, on just day 3.

reply

upvote

by slopinthebag10 hours ago|

[-]

Thats crazy to pin your entire workflow on. Sorry boss, I can't work today I'm being rate limited by Anthropic :/

reply

upvote

by AnonymousPlanet18 hours ago|

[-]

For actually serious work, it's a stark difference if your proprietary and security relevant code is sent abroad to a foreign, possibly future hostile country, or is sent to some data center around the corner. It doesn't even need to be defence related.

reply

upvote

by flatline17 hours ago|

[-]

AFAIK all these companies have SOTA or near-SOTA models available under enterprise licenses. AI companies are not interested in your secret sauce, they are trying to capture the SDLC wholesale.

reply

upvote

by hedora14 hours ago|

[-]

I’m not sure what you are implying by “enterprise license”, but if you think it provides any meaningful protection against malicious US government actors, you really need to read and internalize the US CLOUD Act.

On a related note, I really need to try some local models (probably starting with qwen), since, at least in 2026, the Chinese models are way better at protecting democracy and free speech than the US models.

reply

upvote

by AnonymousPlanet17 hours ago|

[-]

If an American company, let's say a company that writes software for power stations, would use the services of a French or Chinese AI company under such enterprise licenses, how long would you think it would take until someone, in Congress e.g., would interfere?

What if they learned that half of the American small and medium sized companies would have started pouring all their business information into such a service?

reply

upvote

by dnnddidiej13 hours ago|

[-]

That doesn't address the concern. Google isn't interested in violating 1st and 4th amendment rights of people who criticize the government... but they do anyway (or more correctly assist the government in doing so).

reply

upvote

by chatmasta17 hours ago|

[-]

Who are you paying $10/month? OpenRouter?

reply

upvote

by 0xbadcafebee11 hours ago|

[-]

OpenCode Go, BlackBox, Chutes. https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...

reply

upvote

by chatmasta10 hours ago|

[-]

I find Chutes very intriguing… has anyone used it? I found it when I started wondering what sort of $/performance I could get by simply renting GPU machines by the hour and running my own inference.

reply

upvote

by tgrowazay16 hours ago|

[-]

https://platform.minimax.io/docs/guides/pricing-token-plan

reply

upvote

by xutopia16 hours ago|

[-]

How do you use this? Do you use opencode or another frontend?

reply

upvote

by 0xbadcafebee10 hours ago|

[-]

yep, OpenCode with a few plugins (context management, memory, a few MCPs)

reply

upvote

by fnetisma18 hours ago|

[-]

[dead]

reply

upvote

by jjice20 hours ago|

[-]

With them comparing to Opus 4.5, I find it hard to take some of these in good faith. Opus 4.7 is new, so I don't expect that, but Opus 4.6 has been out for quite some time.

reply

upvote

by SwellJoe18 hours ago|

[-]

The thing is, Opus 4.5 is where the model reached Good Enough, at least for a wide variety of problems I use LLMs for. Before that, I almost never thought it was a more productive use of my time to use AI for development tasks, because it would always hallucinate something that would waste a bunch of my time. It just wasn't a good trade.

But, if for some reason everything stopped at Opus 4.5 level and we never got a better model (and 4.6/4.7 are better, if only marginally so and mostly expanding the kind of work it can do rather than making it better at making web apps), we could still do a lot of real work real fast with Opus 4.5, and software development would never go back to everyone handwriting most of the code.

A model as good as Opus 4.5 (or slightly better according to the mostly easily gamed benchmarks) at a 10th the price is probably a worthwhile proposition for a lot of people. $100 a month, or more, to get Opus 4.7 is well worth it for a western developer...the time the lower-end models waste is far more expensive than the cost of using the most expensive models. For the foreseeable future, I'll keep paying a premium for the models that waste less of my time and produce better results with less prodding.

But, also, it's wild how fast things move. Open models you can run on relatively modest hardware are competitive with frontier models of two years ago. I mean, you can run Qwen 3.6 MoE 35B A3B or the larger Gemma 4 models on normal hardware, like a beefy Macbook or a Strix Halo or any recentish 24GB/32GB GPU...not much more expensive than the average developer laptop of pre-AI times. And, it can write code. It can write decent prose (Qwen is maybe better at code, Gemma definitely has better prose), they can use tools, they have a big enough context window for real work. They aren't as good as Opus 4.5, yet.

Anyway, I use several models at this point, for security and code reviews, even if Claude Code with Opus is still obviously the best option for most software development tasks. I'll give Qwen a try, too. I like their small models, which punch well above their weight, I'll probably like the big one, too.

reply

upvote

by Someone123420 hours ago|

[-]

If money is no object, then nothing else is worth considering if it isn't Codex 5.4/Opus 4.7/SOTA. But for many to most people, value Vs. relative quality are huge levers.

Even many people on a Claude subscription aren't choosing or able to choose Opus 4.7 because of those cost/usage pressures. Often using Sonnet or an older opus, because of the value Vs. quality curve.

reply

upvote

by dd8601fn19 hours ago|

[-]

Also us weirdos with local model uses. But your point stands.

reply

upvote

by seplite19 hours ago|

[-]

Unfortunately, like with the release of Qwen3.6-Plus, this model also isn’t released for local use. From the linked article: “Qwen3.6-Max-Preview is the hosted proprietary model available via Alibaba Cloud Model Studio”

reply

upvote

by zozbot23419 hours ago|

[-]

The Max series was never available for local use, though. So this is expected.

reply

upvote

by dd8601fn15 hours ago|

[-]

Sure, not plus or max. I just use their lesser moe ones locally (that would never come close to massive sota models) all the time.

reply

upvote

by CamperBob219 hours ago|

[-]

Cost may or may not be a factor in my choice of model, but knowing the capabilities and knowing they will remain consistent, reliable, and available over time is always a dominant consideration. Lately, Anthropic in particular has not been great at that.

reply

upvote

by jpfromlondon18 hours ago|

[-]

anecdotally the quality of output isn't significantly different, the speed seems to be what you're really paying for, and since the alternative is free I'll stick to local.

reply

upvote

by paprikanotfound17 hours ago|

[-]

What are the best models to run locally?

reply

upvote

by jpfromlondon2 hours ago|

[-]

right now Gemma 4 and Qwen 3.6, I've found the latter to have the slight edge but your results may vary.

reply

upvote

by elAhmo18 hours ago|

[-]

Codex 5.4 is not out?

reply

upvote

by wahnfrieden20 hours ago|

[-]

Codex subscription is very generous at pro tiers

reply

upvote

by oidar19 hours ago|

[-]

Opus 4.6 performance has been so wildly inconsistent over the past couple of months, why waste the tokens?

reply

upvote

by vidarh19 hours ago|

[-]

When Sonnet 4.6 was released, I switchmed my default from Opus to Sonnet because it was about en par with Opus 4.5. While 4.6 and 4.7 are "better", the leap is too small for most tasks for me to need it, and so reducing cost is now a valid reason to stay at that level.

If even cheaper models start reaching that level (GLM 5.1 is also close enough that I'm using it at lot), that's a big deal, and a totally valid reason to compare against Opus 4.5

reply

upvote

by jasonjmcghee18 hours ago|

[-]

Wow I couldn't disagree more.

For me, Opus 4.5 and 4.6 feel so different compared to sonnet.

Maybe I'm lazy or something but sonnet is much worse in my experience at inferring intent correctly if I've left any ambiguity.

That effect is super compounding.

reply

upvote

by hirako200020 hours ago|

[-]

You compare with what's most comparable.

In any case a benchmark provided by the provider is always biased, they will pick the frameworks where their model fares well. Omit the others.

Independent benchmarks are the go to.

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by culi18 hours ago|

[-]

Opus 4.6 was released in February. It can take quite some time to run all these benchmarks properly

reply

upvote

by alex_young19 hours ago|

[-]

Quite some time is a little over 2 months. I understand this is actually true right now, but it’s still a bit hard to accept.

reply

upvote

by cute_boi18 hours ago|

[-]

Comparing it with Opus 4.6 is difficult, since Anthropic may ban accounts and accuse users of state-sponsored hacking.

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by bluegatty19 hours ago|

[-]

I think its only been like 10 weeks. I meant that's forever in AI time, but not a long time in normie people time.

reply

upvote

by jdw6417 hours ago|

[-]

https://www.alibabacloud.com/help/en/model-studio/context-ca... I’ve also been testing models like Opus, Codex, and Qwen, and Qwen is strong in many coding tasks. However, my main concern is how it behaves in long-running sessions.

While Qwen advertises large context windows, in practice the effectiveness of long-context usage seems to depend heavily on its context caching behavior. According to the official documentation, Qwen provides both implicit and explicit context caching, but these come with constraints such as short TTL (around a few minutes), prefix-based matching, and minimum token thresholds.

Because of these constraints, especially in workflows like coding agents where context grows over time, cache reuse may not scale as effectively as expected. As a result, even though the per-token price looks low, the effective cost in long sessions can feel higher due to reduced cache hit rates and repeated computation.

That said, in certain areas such as security-related tasks, I’ve personally had cases where Qwen performed better than Opus.

In my personal experience, Qwen tends to perform much better than Opus on shorter units like individual methods or functions. However, when looking at the overall coding experience, I found it works better as a function-level generator rather than as an autonomous, end-to-end coding assistant like Claude.

reply

upvote

by ezekiel6815 hours ago|

[-]

TBF, it's certainly best practice, advised by the model providers themselves, to cut sessions short and start new ones.

Anthropic's "Best Practices" doc[0] for Claude Code states, "A clean session with a better prompt almost always outperforms a long session with accumulated corrections."

[0] https://code.claude.com/docs/en/best-practices

reply

upvote

by hedora14 hours ago|

[-]

Unless stuff changed since I last checked, context caching just reduces cost / latency. It does not change what tokens are emitted.

reply

upvote

by greyskull10 hours ago|

[-]

I've been using Claude Code regularly at work for several months, and I successfully used it for a small personal project (a website) not long ago. Last weekend, I explored self-hosting for the first time.

Does anyone have a similar experience of having thoroughly used CC/Codex/whatever and also have an analogous self-hosted setup that they're somewhat happy with? I'm struggling a bit.

I have 32GB of DDR5 (seems inadequate nowadays), an AMD 7800X3D, and an RTX 4090. I'm using Windows but I have WSL enabled.

I tried a few combinations of ollama, docker desktop model runner, pi-coding-agent and opencode; and for models, I think I tried a few variants each of Gemma 4, Qwen, GLM-5.1. My "baseline" RAM usage was so high from the handful of regular applications that IIRC it wasn't enough to use the best models; e.g., I couldn't run Gemma4-31B.

Things work okay in a Windows-only setup, though the agent struggled to get file paths correct. I did have some success running pi/opencode in WSL and running ollama and the model via docker desktop.

In terms of actual performance, it was painfully slow compared to the throughput I'm used to from CC, and the tooling didn't feel as good as the CC harness. Admittedly I didn't spend long enough actually using it after fiddling with setup for so long, it was at least a fun experiment.

reply

upvote

by ihowlatthemoon6 hours ago|

[-]

I run a setup similar to yours and I've had the best results with Qwen3.5 27B. Specifically the Q4_K_M variant. https://unsloth.ai/docs/models/qwen3.5

I use llama-server that comes with llama.cpp instead of using ollama. Here are the exact settings I use.

llama-server -ngl 99 -c 192072 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-27B-Q4_K_M.gguf

reply

upvote

by greyskull5 hours ago|

[-]

Thanks, I'll have to continue experimenting. I just ran this model Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL and it works, but if gemini is to be believed this is saturating too much VRAM to use for chat context.

How did you land on that model? Hard to tell if I should be a) going to 3.5, b) going to fewer parameters, c) going to a different quantization/variant.

I didn't consider those other flags either, cool.

Are you having good luck with any particular harnesses or other tooling?

reply

upvote

by ihowlatthemoon4 minutes ago|

[-]

35B-A3B means it's a MoE model with 35B total parameters but with only 3B active at once. The one I use is the 27B dense model. Usually, dense models give better responses, but are slower than the MoE. With your 4090, you should be able to get about 50 tok/s with the dense model, which is more than enough for practical use.

If you want to keep using the same model, these settings worked for me.

llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

For the harness, I use pi (https://pi.dev/). And sometimes, I use the Roo Code plugin for VS Code. (https://roocode.com/)

I prefer simplicity in my tooling, so I can understand them easier. But you might have better luck with other harnesses.

reply

upvote

by martinald10 hours ago|

[-]

Try using a MoE model (like Gemma 4 26b-a4b or qwen3.6 35b-a3b) and offload the inference to CPU. If you have enough system RAM (32GB is a bit tight tbh depending on other apps) then this works really well. You may be able to offload some layers to GPU as well though I've had issues with this in MoE models and llama.cpp.

You can keep the KV cache on GPU which means it's pretty damn fast and you should be able to hold a reasonable context window size (on your GPU).

I've had really impressive results locally with this.

I'd strongly recommend cloning llama.cpp locally btw (in wsl2) and asking a frontier model in eg Claude code to set it up for you and tweak it. In my experience the apps that sit on top of llama.cpp don't expose all the options and flags and one wrong flag can mean terrible performance (eg context windows not being cached). If you compile it from source with a coding agent it can look up the actual code when things go wrong.

You should be able to get at least 20-40tok/s on that machine on Gemma 4 which is very usable, probabaly faster on qwen3.6 since it's only 3b active params.

reply

upvote

by Ey7NFZ3P0nzAe4 hours ago|

[-]

In my case, I was also running an ASR model and a TTS model so it was a bit much for my RTX 3090. I opted to offset like 5 layers to the cpu while adding a GPU-only speculative decoding with their 0.8B model.

Working well so far.

reply

upvote

by greyskull9 hours ago|

[-]

Thanks! These things you're mentioning like "You may be able to offload some layers to GPU...", "You can keep the KV cache on GPU..." configured as part of the llama.cpp? I wouldn't know what to prompt with or how to evaluate "correctness" (outside of literally feeding your comment into claude and seeing what happens).

Aside: what is your tooling setup? Which harness you're using (if any), what's running the inference and where, what runs in WSL vs Windows, etc.

I struggle to even ask the right questions about the workflow and environment.

reply

upvote

by madtowneast10 hours ago|

[-]

You are experiencing the fact that you might not have enough VRAM to load the entire model at a time. You might want to try https://github.com/AlexsJones/llmfit

reply

upvote

by greyskull9 hours ago|

[-]

It's certainly part of the problem. Thanks, I'll give that a shot.

reply

upvote

by daemonologist9 hours ago|

[-]

First of all nothing you can run locally, on that machine anyways, is going to compare with Opus. (Or even recent Sonnet tbh - some small models benchmark better but fall off a bit in the real world.) This will get you close to like ~Sonnet 4 though:

Grab a recent win-vulkan-x64 build of llama.cpp here: https://github.com/ggml-org/llama.cpp/releases - llama.cpp is the engine used by Ollama and common wisdom is to just use it directly. You can try CUDA as well for a speedup but in my experience Vulkan is most likely to "just work" and is not too far behind in speed.

For best quality, download the biggest version of Qwen 3.5 27B you can fit on your 4090 while still leaving room for context and overhead: https://huggingface.co/unsloth/Qwen3.5-27B-GGUF - I would try the UD-Q5_K_XL but you might have to drop down to Q5_K_S. For best speed, you could use Qwen 3.6 35B-A3B (bigger model but fewer parameters are active per token): https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF - probably the UD-Q4_K_S for this one.

Now you need to make sure the whole model is fitting in VRAM on the 4090 - if anything gets offloaded to system memory it's going to slow way down. You'll want to read the docs here: https://github.com/ggml-org/llama.cpp/tree/master/tools/serv... (and probably random github issues and posts on r/localllama as well), but to get started:

  llama-server -m /path/to/above/model/here.gguf --no-mmap --fit on --fit-ctx 20000 --parallel 1

This will spit out a whole bunch of info; for now we want to look just above the dotted line for "load_tensors: offloading n/n layers to GPU" - if fewer than 100% of the layers are on GPU, inference is going to be slower and you probably want to drop down to a smaller version of the model. The "dense" 27B will be slowed more by this than the "mixture-of-experts" 35B-A3B, which has to move fewer weights per token from memory to the GPU.

Go to the printed link (localhost:8080 by default) and check that the model seems to be working normally in the default chat interface. Then, you're going to want more context space than 20k tokens, so look at your available VRAM (I think the regular Windows task manager resource monitor will show this) and incrementally increase the fit-ctx target until it's almost full. 100k context is enough for basic coding, but more like 200k would be better. Qwen's max native context length is 262,144. If you want to push this to the limit you can use `--fit-target <amount of memory in MB>` to reduce the free VRAM target to less than the default 1024 - this may slow down the rest of your system though.

Finally, start hooking up coding harnesses (llama-server is providing an OpenAI-compatible API at localhost:8080/v1/ with no password/token). Opencode seems to work pretty reliably, although there's been some controversy about telemetry and such. Zed has a nice GUI but Qwen sometimes has trouble with its tools. Frankly I haven't found an open harness I'm really happy with.

reply

upvote

by greyskull7 hours ago|

[-]

Thank you for all this, I'll give it a shot. Out of curiosity, are there any resources that sort of spell this out already? i.e., not requiring a comment like this to navigate.

> nothing you can run locally, on that machine anyways, is going to compare with Opus

Definitely not expecting that. Just wanted to find a setup that individuals were content with using a coding harness and a model that is usable locally.

What does your setup look like? Model, harness, etc.

reply

upvote

by unethical_ban5 hours ago|

[-]

This is exactly what I have been looking for: Something straight to the point. Thanks a lot!

reply

upvote

by fr3on14 hours ago|

[-]

The irony of this announcement is in the name: Max-Preview is proprietary, cloud-only. The Qwen models that actually matter — the ones running on real hardware people own — are the open weights series. I run the 32B and 72B variants locally on dual A4000s. The gap between those and the hosted Max is real, but it's shrinking with every release. The interesting question isn't how Max compares to Opus. It's how long until the open-weight tier makes the cloud tier irrelevant for most workloads.

reply

upvote

by sva_14 hours ago|

[-]

[flagged]

reply

upvote

by trvz20 hours ago|

[-]

The fun thing is, you can be aware of the entire range of Qwen models that are available for local running, but not at all about their cloud models.

I knew of all the 3.5’s and the one 3.6, but only now heard about the Plus.

reply

upvote

by Alifatisk19 hours ago|

[-]

Their Plus series have existed since Qwen chat was available , as far as I remember. I can at least remember trying out their Plus model early last year.

reply

upvote

by wg018 hours ago|

[-]

Notice the pattern that Chinese providers are now:

1. Keeping models closed source.

2. Jacking up pricing. A lot. Sometimes up to 100% increase.

reply

upvote

by embedding-shape18 hours ago|

[-]

Huh yeah, that's truly a unique trait these Chinese companies don't share with companies in other countries.

reply

upvote

by aerhardt16 hours ago|

[-]

No it is not, but they had a unique positioning around open-source and the parent commenter means that they are losing it.

reply

upvote

by esperent8 hours ago|

[-]

Again, a trait they share with companies in other countries. It's the obvious business model: get known by releasing impressive open models, then pivot to closed for even more impressive models.

That's going to be the path for every new company from every country, I assume. They are not releasing open models out of the goodness of their hearts. They are for-profit companies, they don't have hearts, they just have balance sheets.

reply

upvote

by halJordan11 hours ago|

[-]

Qwen max has always been cloud only. And its a 1T+ model so it would be expensive

reply

upvote

by nicce16 hours ago|

[-]

> Jacking up pricing. A lot. Sometimes up to 100% increase.

How is that different from American?

reply

upvote

by Tepix17 hours ago|

[-]

Are you talking about GLM 5.1, DeepSeek V3.2 or Kimi K2.6 (released one hour ago!)?

Oh wait, it doesn't apply to those…

reply

upvote

by Kerrick16 hours ago|

[-]

Z.ai's Coding Plan with GLM 5.1 (Max) did more than double in price. It was $80 two weeks ago, and now it's $160.

reply

upvote

by 15 hours ago|

[-]

deleted

reply

upvote

by slopinthebag15 hours ago|

[-]

Coding plans are subsidised crap anyways, the real price win is the API pricing which is not.

reply

upvote

by dingocat16 hours ago|

[-]

Yet.

reply

upvote

by OtomotO18 hours ago|

[-]

US companies hate that trick?!

reply

upvote

by rc_kas17 hours ago|

[-]

you mean: invented

reply

upvote

by sunaookami15 hours ago|

[-]

Yeah Claude Haiku (don't remember the version) did it first, they claimed it was because "it's smarter now" (it's still dumb). Then OpenAI did it with GPT-5 and Google did the same with Gemini Flash and now every new model version is at least twice as expensive than the one before that.

reply

upvote

by cnlwsu17 hours ago|

[-]

what only Oracle can do it?

reply

upvote

by cute_boi18 hours ago|

[-]

Well, they can't subsidize forever. And, it is kinda expected?

reply

upvote

by gpm17 hours ago|

[-]

Considering the propaganda value in controlling the inputs to the machine that answers peoples questions, I rather expect them to be subsidized forever.

reply

upvote

by bigyabai16 hours ago|

[-]

Consider the propaganda value of a centrally-controlled apparatus like the iPhone, and then reflect on the 100%+ profit margins that product has enjoyed for the past decade.

reply

upvote

by throwaway61374616 hours ago|

[-]

[dead]

reply

upvote

by ai_fry_ur_brain17 hours ago|

[-]

Yeah, its almost like the casinos started rigging the game after they got all the addicts hooked. Who saw that coming???

If you overuse LLMs or get excited about them at all, you're ngmi and a complete idiot.

reply

upvote

by atilimcetin19 hours ago|

[-]

Nowadays, I'm working on a realtime path tracer where you need proper understanding of microfacet reflection models, PDFs, (multiple) importance sampling, ReSTIR, etc.. Saying that mine is a somewhat specific use case.

And I use Claude, Gemini, GLM, Qwen to double check my math, my code and to get practical information to make my path tracer more efficient. Claude and Gemini failed me more than a couple of times with wrong, misleading and unnecessary information but on the other hand Qwen always gave me proper, practical and correct information. I’ve almost stopped using Claude and Gemini to not to waste my time anymore.

Claude code may shine developing web applications, backends and simple games but it's definitely not for me. And this is the story of my specific use case.

reply

upvote

by wg018 hours ago|

[-]

I have said similar things about someone experiencing similar things while writing some OpenGL code (some raytracing etc) that these models have very little understanding and aren't good at anything beyond basic CRUD web apps.

In my own experience, even with web app of medium scale (think Odoo kind of ERP), they are next to useless in understanding and modling domain correctly with very detailed written specs fed in (whole directory with index.md and sub sections and more detailed sections/chapters in separate markdown files with pointers in index.md) and I am not talking open weight models here - I am talking SOTA Claude Opus 4.6 and Gemini 3.1 Pro etc.

But that narrative isn't popular. I see the parallels here with the Crypto and NFT era. That was surely the future and at least my firm pays me in cypto whereas NFTs are used for rewarding bonusess.

reply

upvote

by wg018 hours ago|

[-]

Someone exactly said it better here[0] already.

[0]. https://news.ycombinator.com/item?id=47817982

reply

upvote

by esperent8 hours ago|

[-]

To be fair, I've had the extreme misfortune of working on Odoo code and I can understand why an LLM would struggle.

Yearly breaking changes but impossible to know what version any example code you find is related to (except that if you're on the latest version, it's definitely not for your version), closed and locked down forum (after several months of being a paying customer, I couldn't even post a reply, let alone ask a question), weird split between open and closed, weird OWL frontend framework that seems to be a bad clone of an old React version, etc. etc. Painful all around. I would call this kind of codebase pre-LLM slop, accreted over many years of bad engineering decisions.

reply

upvote

by amarcheschi18 hours ago|

[-]

a semester ago i was taking a machine learning exam in uni and the exam tasked us with creating a neural network using only numerical libraries (no pytorch ecc). I'm sure that there are a huge lot of examples looking all the same, but given that we were just students without a lot of prior experience we probably deviated from what it had in its training data, with more naive or weird solutions. Asking gemini 3 to refactor things or in very narrow things to help was ok, but it was quite bad at getting the general context, and spotting bugs, so much that a few times it was easier to grab the book and get the original formula right

otoh, we spotted a wrong formula regarding learning rate on wikipedia and it is now correct :) without gemini and just our intuition of "mhh this formula doesn't seem right", that definitely inflated our ego

reply

upvote

by muyuu17 hours ago|

[-]

for Anthropic and OpenAI there is a very real danger that people invest serious time finding the strengths of alternative models, esp Chinese/open models that can to some degree be run locally as well

it puts a massive backstop at the margins they can possibly extract from users

reply

upvote

by zozbot23419 hours ago|

[-]

What size of Qwen is that, though? The largest sizes are admittedly difficult to run locally (though this is an issue of current capability wrt. inference engines, not just raw hardware).

reply

upvote

by atilimcetin19 hours ago|

[-]

I'm directly using https://chat.qwen.ai (Qwen3.6-Plus) and planning to switch to Qwen Code with subscription.

reply

upvote

by jasonjmcghee18 hours ago|

[-]

You may be interested in "radiance cascades"

reply

upvote

by hedora14 hours ago|

[-]

What do you use instead of the Claude code client app?

reply

upvote

by jansan19 hours ago|

[-]

How "social" does Quen feel? The way I am using LLMs for coding makes this actually the most important aspect by now. Claude 4.6 felt like a nice knowledgeable coworker who shared his thinking while solving problems. Claude 4.7 is the difficult anti-social guy who jumps ahead instead of actually answering your questions and does not like to talk to people in general. How are Qwen's social skills?

reply

upvote

by zozbot23418 hours ago|

[-]

Qwen feels like wise Chinese philosopher. Talks in very short elegant sentences, but does very solid work.

reply

upvote

by Alifatisk18 hours ago|

[-]

> Talks in very short elegant sentences

This is not my experience at all, Qwen3.6-Plus spits out multiple paragraphs of text for the prompts I give. It wasn't like this before. Now I have to explicitly tell it not to yap so much and keep it short, concise and direct.

reply

upvote

by johnnyApplePRNG4 hours ago|

[-]

Nowhere near the power of ChatGPT 5.4 Pro imho... thought for maybe 15 seconds on a problem that pro would have spend 15 minutes on... and the results really show :/

reply

upvote

by djyde13 hours ago|

[-]

I've been using glm5.1 for pretty much all my coding work, but Claude is too expensive for me. Haven't tried qwen yet though. China's coding models are now very cost-effective.

reply

upvote

by djyde12 hours ago|

[-]

But I've recently found that Cursor's composer2 is also really good to use.

reply

upvote

by freely008510 hours ago|

[-]

Composer 2 is just Kimi 2.5, it's not their own model.

reply

upvote

by piotraleksander3 hours ago|

[-]

it's such a misinformed statement, as kimi2.5 was used as a base model for composer 2 and then heavily RLed

reply

upvote

by 12 hours ago|

[-]

deleted

reply

upvote

by 12 hours ago|

[-]

deleted

reply

upvote

by 12 hours ago|

[-]

deleted

reply

upvote

by Oras19 hours ago|

[-]

I find it odd that none of OpenAI models was used in comparison, but used Z GLM 5.1. Is Z (GLM 5.1) really that good? It is crushing Opus 4.5 in these benchmarks, if that is true, I would have expected to read many articles on HN on how people flocked CC and Codex to use it.

reply

upvote

by ac2919 hours ago|

[-]

GLM 5.1 is pretty good, probably the best non-US agentic coding model currently available. But both GLM 5.0 and 5.1 have had issues with availability and performance that makes them frustrating to use. Recently GLM 5.1 was also outputting garbage thinking traces for me, but that appears to be fixed now.

reply

upvote

by cmrdporcupine19 hours ago|

[-]

Use them via DeepInfra instead of z.ai. No reliability issues.

https://deepinfra.com/zai-org/GLM-5.1

Looks like fp4 quantization now though? Last week was showing fp8. Hm..

reply

upvote

by wolttam19 hours ago|

[-]

Deepinfra's implementation of it is not correct. Thinking is not preserved, and they're not responding to my submitted issue about it.

I also regularly experience Deepinfra slow to an absolute crawl - I've actually gotten more consistent performance from Z.ai.

I really liked Deepinfra but something doesn't seem right over there at the moment.

reply

upvote

by cmrdporcupine18 hours ago|

[-]

Damn. Yeah, that sucks. I did play with it earlier again and it did seem to slow down.

It's frankly a bummer that there's not seemingly a better serving option for GLM 5.1 than z.AI, who seems to have reliability and cost issues.

reply

upvote

by coder6819 hours ago|

[-]

In fact it is appreciated that Qwen is comparing to a peer. I myself and several eng I know are trying GLM. It's legit. Definitely not the same as Codex or Opus, but cheaper and "good enough". I basically ask GLM to solve a program, walk away 10-15 minutes, and the problem is solved.

reply

upvote

by Oras19 hours ago|

[-]

cheaper is quite subjective, I just went to their pricing page [0] and cost saving compared to performance does not sell it well (again, personal opinion).

CC has a limited capacity for Opus, but fairly good for Sonnet. For Codex, never had issues about hitting my limits and I'm only a pro user.

https://z.ai/subscribe

reply

upvote

by kardianos19 hours ago|

[-]

Yes. GLM 5.1 is that good. I don't think it is as good as Claude was in January or February of this year, but it is similar to how Claude runs now, perhaps better because I feel like it's performance is more consistent.

reply

upvote

by vidarh19 hours ago|

[-]

GLM 5.1 is the first model I've found good enough to spring for a subscription for other than Claude and Codex.

It's not crushing Opus 4.5 in real-life use for me, but it's close enough to be near interchangeable with Sonnet for me for a lot of tasks, though some of the "savings" are eaten up by seemingly using more tokens for similar complexity tasks (I don't have enough data yet, but I've pushed ~500m tokens through it so far.

reply

upvote

by pros19 hours ago|

[-]

I'm using GLM 5.1 for the last two weeks as a cheaper alternative to Sonnet, and it's great - probably somewhere between Sonnet and Opus. It's pretty slow though.

reply

upvote

by bensyverson17 hours ago|

[-]

This is what kills it for me… The long thinking blocks can make a simple task take 30 minutes.

reply

upvote

by Alifatisk19 hours ago|

[-]

GLM-5 is good, like really good. Especially if you take pricing into consideration. I paid 7$ for 3 months. And I get more usage than CC.

They have difficulty supplying their users with capacity, but in an email they pointed out that they are aware of it. During peak hours, I experience degraded performance. But I am on their lowest tier subscription, so I understand if my demand is not prioritized during those hours.

reply

upvote

by ekuck18 hours ago|

[-]

Where are you getting 3 months for $7?

reply

upvote

by Alifatisk17 hours ago|

[-]

They had a Christmas deal that ended January 31.

reply

upvote

by culi18 hours ago|

[-]

If you only look at open models, GLM 5.1 is the best performance you can get on on the Pareto distribution

https://arena.ai/leaderboard/text?viewBy=plot&license=open-s...

reply

upvote

by c0n5pir4cy19 hours ago|

[-]

I've been using it through OpenCode Go and it does seem decent in my limited experience. I haven't done anything which I could directly compare to Opus yet though.

I did give it one task which was more complex and I was quite impressed by. I had a local setup with Tiltdev, K3S and a pnpm monorepo which was failing to run the web application dev server; GLM correctly figured out that it was a container image build cache issue after inspecting the containers etc and corrected the Tiltfile and build setup.

reply

upvote

by cleaning19 hours ago|

[-]

Most HN commenters seem to be a step behind the latest developments, and sometimes miss them entirely (Kimi K2.5 is one example). Not surprising as most people don't want to put in the effort to sift through the bullshit on Twitter to figure out the latest opinions. Many people here will still prefer the output of Opus 4.5/4.6/4.7, nowadays this mostly comes down to the aesthetic choices Anthropic has made.

reply

upvote

by Oras19 hours ago|

[-]

Not just aesthetics though, from time to time I implement the same feature with CC and Codex just to compare results, and I yet to find Codex making better decisions or even the completeness of the feature.

For more complicated stuff, like queries or data comparison, Codex seems always behind for me.

reply

upvote

by throwaw1219 hours ago|

[-]

maybe they decided OpenAI has different market, hence comparing only with companies who are focusing in dev tooling: Claude, GLM

reply

upvote

by edwinjm19 hours ago|

[-]

Haven’t you heard about Codex?

reply

upvote

by throwaw1219 hours ago|

[-]

its an SKU from OpenAI's perspective, broader goal and vision is (was) different. Look at the Claude and GLM, both were 95% committed to dev tooling: best coding models, coding harness, even their cowork is built on top of claude code

reply

upvote

by zozbot23419 hours ago|

[-]

I'm not sure how this makes sense when Claude models aren't even coding specific: Haiku, Sonnet, Opus are the exact same models you'd use for chat or (with the recent Mythos) bleeding edge research.

reply

upvote

by throwaw1219 hours ago|

[-]

Anthropic models and training data is optimized for coding use cases, this is the difference.

OpenAI on the other hand has different models optimized for coding, GPT-x-codex, Anthropic doesnt have this distinction

reply

upvote

by pixel_popping18 hours ago|

[-]

But they detect it under the hood and apply a similar "variant", as API results are not the same than on Claude Code (that was documented before by someone).

reply

upvote

by __blockcipher__19 hours ago|

[-]

Yeah GLM’s great for coding, code review, and tool use. Not amazing at other domains.

reply

upvote

by esafak19 hours ago|

[-]

I use it and think its intelligence compares favorably with OpenAI and Anthropic workhorses. Its biggest weakness is its speed.

reply

upvote

by XCSme17 hours ago|

[-]

A bit weird to be comparing it to Opus-4.5 when 4.7 was released...

reply

upvote

by chatmasta17 hours ago|

[-]

Is this going to be an open weights model or not? The post doesn’t make it clear. It seems the weights are not available today, but maybe that’s because it’s in preview?

reply

upvote

by zozbot23417 hours ago|

[-]

The Max series has never been open.

reply

upvote

by digimantis9 hours ago|

[-]

i dont get why people defend $200/month models against open source model that cost 1/10 of the price, like literally

reply

upvote

by marsulta18 hours ago|

[-]

I think the benchmarks and numbers need to be easier to read. Those benchmarks are useless to the regular consumer.

reply

upvote

by o1044936616 hours ago|

[-]

I have the M3 Max MBP with 128 GB of memory and the 40 core GPU. What's the best local model I can run today for coding?

reply

upvote

by alx-ppv15 hours ago|

[-]

You can try https://github.com/AlexsJones/llmfit

reply

upvote

by fragmede5 hours ago|

[-]

This thing on Celebras is going to be ridiculous.

reply

upvote

by Aeroi14 hours ago|

[-]

why do people continue to benchmark their sota models against older models.

reply

upvote

by xmly16 hours ago|

[-]

Very impressive!

reply

upvote

by DeathArrow19 hours ago|

[-]

I am trying since one week to subscribe Alibaba Coding Plan (to use Qwen 3.6 Plus) but it's always out of stock.

They brag about Qwen but don't let people use it.

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by alanmercer6 hours ago|

[-]

[dead]

reply

upvote

by EthanFrostHI7 hours ago|

[-]

[dead]

reply

upvote

by JLO6419 hours ago|

[-]

[dead]

reply

upvote

by mockbolt16 hours ago|

[-]

[dead]

reply

upvote

by bauratynov4 hours ago|

[-]

[dead]

reply

upvote

by souravroyetl17 hours ago|

[-]

[flagged]

reply

upvote

by dakolli16 hours ago|

[-]

ToKeN PrIcEs ArE gOiNg tO PluMmEt, InTelLigEnCe WiLl Be AfForDaBlE FoR EvErYOnE

reply