Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.
But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.
I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses: - Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site. - The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.
Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.
If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.
My impression is that the choice of harness matters a lot.
I've been testing it for awhile now since it seemed to have potential as a local model.
With this new update it still cannot parse simple, test PDFs correctly. It inconsistently tells me that the value in the name field in the document is incorrect, and has the name reversed to put the last name first. Or that a date is wrong as it's in the past/future, when it is not. Tons of fundamental errors like that.
Even when looking at the thinking process there are issues:
I used a test website for it to analyze and it says that the sites copyright year states 2026 which is in the future and to investigate as it could be an attack, but right after prints today's correct date.
I'm in the process of trying to get it uncensored. Hopefully that will create some use out of z.ai
Edit: by the way, which is the best uncensored model at the moment?
I also use Claude premium daily for another client, and i use Codex. and i can tell you that GLM5 is at this point much more capable than Claude and Codex for complex backend end work, complex feature planning, and long horizon tasks. One thing i've noticed is that it is particularly good at following instructions and guidelines, even deep into the execution of a plan.
To me the only problem is that z.ai have had trouble with inference : the performance of their API has been pretty poor at times. It looks like this is an hardware issue related to the Huawei chips they use rather than an issue with the model itself. The situation has been substantially improving over the past few weeks.
GLM5.1, GLM5-Turbo and GLM5v are at this point better than Opus, Codex, Gemini and other claude source models. We have reached a major turning point. To me, the only closed source model still in the game is codex as it is much faster at executing simple tasks and implementing already created plans.
Try GLM5v for your PDF work, it's their last generation vision model that has been released a couple of days ago.
>For AI computing, the Atlas 950 SuperPoD, powered by UnifiedBus, integrates 64 NPUs per cabinet and can scale up to 8,192 NPUs, delivering superior performance for large-scale AI training and high-concurrency inference.
Codex and GLM didnt have any issue following the exact same plan and getting a working app. So I would argue Gemini is the failure here.
"It couldn't even debug some moderately complicated python scripts reliably."
What wild claim to make. Unsupported by benchmarks, unsupported by the consensus of the community, no evidence provided.
Sounds like in another comment here even the GLM5 team concedes they are behind the frontier wrt tool calling, do you know something they don’t?
My only goal is to encourage people to try it out so they can see if it moves the needle for them, because there are fair chances that it will. I am not trying to start a flamewar or something.
You’re making a claim, and I’m pointing out that it’s unsubstantiated and not consistent with any other source of data, including that internal to the company that makes the model.
I hope you can see that that’s different than saying it’s worked well for me
I do not think that anyone who read my comment understood it differently. But I grant you this point, this is just my opinion based on my personal experience not the result of a scientific study.
Once this is said, i wasn't submitting a scientific paper for preprint, just posting my opinion on an internet forum.
Not sure why you are making such a big deal out of it, especially for something for which people can decide within minutes if it works for them or not. And I haven't seen you nitpick on other people saying that all Chinese models are garbage incapable of doing even the most basic task, without quoting any study. This kind of scrutiny tends to be one-sided.
Edit: and regarding what the z.ai team is saying about their models, just check their Discord and the articles they link there. They themselves say that their latest models have leading performance on a number of aspects. It is misleading to suggest that the authors of the model are not proudly saying that their models have best in class performance.
https://huggingface.co/trohrbaugh/gemma-4-31b-it-heretic-ara...
which was produced immediately after Google released their new Gemma 4 model.
I had no such trouble with 4.7 and find it fast and productive. Haven't tried 5.1; am using openAI models for coding most of the time.
Z.ai seem to promote 4.7 for smaller tasks, 5.1 for larger tasks (similar to Anthropic's recommendation for usage of Haiku and Sonnet/Opus models).
5.1 works for me already in the most economical basic paid tier ("lite coding plan"), unlike first release of v5 (5.0 ?)
There are no such models, depending on your definition of censorship. If you're referring to abliteration and similar automated techniques, they're snake oil.
Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
With LLMs it feels more like the old punchcards, though.
For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.
i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
From my testing it was ok until 145k tokens, the largest context I had before switching to a new session. I think Z.ai officially said it should be good until 200k tokens.
Using it in Open Code is compacting the context automatically when it gets too large.
(1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat
(2) Local/private inference is the future of AI
(3) There's *still* no killer product yet (so get to work!)1) OpenAI and Anthropic are killing it, and continue to do so, their coding tools are unmatched for professionals.
2) Local models don't hold a candle to SOTA models and there's nothing on the horizon that indicates that consumers will be able to run anything close to what you can get in a data center.
3) Coding is a killer product, OpenAI and Anthropic are raking in the cash. The top 3 apps are apps in the app store are AI. Everyone who knows anything is using AI, every day, across the economy.
On (2), I agree with you for local models. BUT, there are also the open source Chinese models accessible via open-router. Your argument ("don't hold a candle to SOTA models") does not hold if the comparison is between those.
On (1), I agree more with the grandparent than with your assessment. Yes, OpenAI and Anthropic are killing it for now, but the time horizon is very short. I use codex and claude daily, but it's also clear to me that open source is catching up quickly, both w.r.t. the models and the agentic harnesses.
Nowadays I also feel model performance matters less than the design of the tool harness, inference speed, and the other systems that surround a typical coding model.
I thought so myself, but after burning a lot of money on OpenRouter in a few days I just subscribed to Z.ai's Coding Pro plan and using the subscription is much, much friendlier with my wallet.
And? They aren't as good as SOTA models. Even the SOTA model provider's small models aren't worth using for many of my coding tasks.
(1): You don't have to be an Ed Zitron disciple to infer that OpenAI and Anthropic are likely overvalued and that Nvidia is selling everyone shovels in a gold rush. AI is a game-changing technology, but a shitty chat interface does not a company make. OpenAI and Anthropic need to recoup astronomical costs used in training these models. Models that are now being distilled[1] and are quickly becoming commoditized. (And frankly, models that were trained by torrenting copyrighted data[2], anyway.) Many have been calling this out for years: the model cannot be your product. And to be clear, OpenAI/Anthropic most definitely know this: that's why they've been aquihiring like crazy, trying to find that one team that will make the thing.
(2): Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this. Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like. The end-state here is likely some models running locally and some running in the cloud. But the current state of OpenClaw token-vomit on top of Claude is fiscally untenable (in fact, this is why Anthropic shut it down).
(3): This is typical Dropbox HN snark[3], of which I am also often guilty of. I really don't think AI coding is a killer product and this seems very myopic—engineers are an extreme minority. Imo, the closest we've seen to something revolutionary is OpenClaw, but it's janky, hard to set up, full of vulnerabilities, and you need to buy a separate computer. But there's certainly a spark there. (And that's personally the vertical I'm focusing on.)
[1] https://www.anthropic.com/news/detecting-and-preventing-dist...
[2] https://media.npr.org/assets/artslife/arts/2025/complaint.pd...
Anthropic is up to $30B annual recurring revenue. I wish I had failing business models like that.
> Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this. Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like.
I'm not sure what think you are saying here, but if you look at the providers for both "almost-SOTA model (a big Deepseek or Qwen model)" or at the price for Claude on AWS Bedrock, Azure or on GCP you will quickly see inference is very profitable.
And profit? A company can have $300B annual revenue, and still be a failing business if it's making a loss.
Somewhere along the line we seem to have forgotten this basic fact. Eventually there will be no more rounds of funding to feed the fire.
Even if you say we are going to measure profit in the very special hacker news way of looking at money taken in from customer revenue against money invested and we say they can't do things like counting building data centers or buying GPUs as capital expenses and instead have to count them against profit then in 2 years time they will have made more money than they have taken in investment.
That is extraordinary.
> If every year we predict exactly what the demand is going to be, we’ll be profitable every year. Because spending 50% of your compute on research, roughly, plus a gross margin that’s higher than 50% and correct demand prediction leads to profit. That’s the profitable business model that I think is kind of there, but obscured by these building ahead and prediction errors.
(a lot more at the link)
https://www.dwarkesh.com/p/dario-amodei-2?open=false#%C2%A70...
[1] https://fortune.com/2025/01/07/sam-altman-openai-chatgpt-pro...
Qwen3.5-122B-A10B is $0.26 input, $2.08 output. Where's the subsidy? It's ten times cheaper than Opus. Or did you mean that we're subsidizing their training? But then "OpenClaw token-vomit on top of Claude is fiscally untenable" makes no sense.
Yeah, I don't know where you got your costs from. Bare metal providers are significantly cheaper than Anthropic.
GPU and RAM prices have definitely not made consumer PC's cheaper than they were before bitcoin blew up or before AI blew up.
Maybe you could make an argument that they are more cost efficient for the price point... But that's not the same as cheaper when every application or program is poorly optimized. For example why would a browser take up more than a GB or two of RAM?
And I'd postulate that R&D to develop localized AI is another example, the big players seem hellbent that there needs to be a most and it's data centers... The absolute opposite of optimization
We've had RAM shocks before. We nerds can't control the Wall Street or Virginians who like to break the world every so often for the lulz. However, a wobble on the curve doesn't change the curve's destination.
Landing a man on the moon is way more impressive. Finding several vaccines for a once in a century pandemic within a year of its outbreak is and achievement that in its impact and importance dwarfs what the entire LLM industry put together has achieved. The near-complete eradication of polio, once again, way more important and impactful.
I'd like to think the superior product wins. But Windows still thrives despite widespread Linux availability. I think sometimes we can underestimate the resilience of the tech oligopolies, particularly when they're VC-funded.
If I want to switch from Windows to Linux, I have to reconsider a whole variety of applications, learn a different UX, migrate data, all sorts of annoyances.
When I switch between Codex and Claude Code, there is literally no difference in how I interact with them. They and a number of other competitors are drop in replacements for each other.
That's because by most metrics Linux is inferior is Windows.
GLM 5.1 has 754B parameters tho. And you still need RAM for context too. You'll want much more than 96GB ram.
I can totally see the same happening here; on-device LLMs are a toy, and then they eat the world and everyone has their own personal LLM running on their own device and the cloud LLMs are a niche used by large institutions.
I can easily see the advantage, even now, of running the LLM locally. As others have said in this topic. I think it'll happen.
edit: thanks for clarifying :)
That's a valuable guarantee. So valuable, in fact, that you won't get it from Anthropic, OpenAI, or Google at any price.
Second answer: ask an AI, but prices have risen dramatically since their training cutoff, so be sure to get them to check current prices.
Third answer: I'm not an expert by a long shot, but I like building my own PCs. If I were to upgrade, I would buy one of these:
Framework desktop with 128gb for $3k or mainboard-only for $2700 (could just swap it into my gaming PC.) Or any other Strix Halo (ryzen AI 385 and above) mini PC with 64/96/128gb; more is better of course. Most integrated GPUs are constrained by memory bandwidth. Strix Halo has a wider memory bus and so it's a good way to get lots of high-bandwidth shared system/video RAM for relatively cheap. 380=40%; 385=80%; 395=100% GPU power.
I was also considering doing a much hackier build with 2x Tesla P100s (16gb HBM2 each for about $90 each) in a precision 5820 (cheap with lots of space and power for GPUs.) Total about $500 for 32gb HBM2+32gb system RAM but it's all 10-year-old used parts, need to DIY fan setup for the GPUs, and software support is very spotty. Definitely a tinker project; here there be dragons.
I run qwen 122b with Claude code and nanoclaw, it's pretty decent but this stuff is nowhere prime time ready, but super fun to tinker with. I have to keep updating drivers and see speed increases and stability being worked on. I can even run much larger models with llama.cpp (--fit on) like qwen 397b and I suppose any larger model like GLM, it's slow but smart.
For a hobby/enthusiast product, and even for some useful local tasks, MoE models run fine on gaming PCs or even older midrange PCs. For dedicated AI hardware I was thinking of Strix Halo - with 128gb is currently $2-3k. None of this will replace a Claude subscription.
1) What are you going to use that for? 0.6 model gives you what you could get from Siri when it first launched at most unless you do some tunning.
2) Pretty clear that they are talking about GLM-5.1 4-bit quant.
We probably talk abuot a year of progress diffeerence.
Its also still quite expensive for an avg person to consume any of it. Either due to hardware invest, energy cost or API cost.
Also professionally I don't think anyone will really spend a little bit less money of having the 3th quality model running if they can run the best model.
I'm happy that we reach levels were this becomes an alternative if you value open and control though.
(2) is probably true but with caveats. Top-tier models will never run on desktop machines, but companies should (and do) host their own models. The future is open-weight though, that much is for sure.
(3) This is so ignorant that others have already responded to it. Look outside of your own bubble, please.
Sorry, but you don't know that
Every time I asked a question it generated an interactive geometry graph on the fly in Javascript. Sometimes it spent minutes compiling and testing code on the server so it could make sure it was correct. I was really impressed.
Anyway I couldn't really learn anything since when the code didn't work I wasn't sure if I had ported it wrong or the AI did it wrong, so I ended up learning how to calculate SDF and pixel to hex grid from tutorials I found on google instead.
I think big corporations will continue to use them no matter how cheap and good other models are. There's a saying: nobody was fired for buying IBM.
Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).
I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.
My use cases are not code editing or authoring related, but when it comes to understanding a codebase and it's docs to help stakeholders write tasks or understand systems it has always outperformed american models at roughly half the price.
It's a fun way to quantify the real-world performance between models that's more practical and actionable.
Overeager, but I was really really impressed.
xkcd was prescient once again... https://xkcd.com/416/
I think the model is now tuned more towards agentic use/coding than general intelligence.
[0]: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...
And Opus is absolutely terrible at guessing how many tokens it's used. Having that as a number that the model can access itself would be a real boon.
So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.
This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.
Claude Opus at 150K context starts getting dumber and dumber.
Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.
If you want quality you still have to compact or start new contextes often.
For around a month the limit seemed to be a little over 60k! I was despondent!!
What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.
More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?
But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.
I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.
There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.
The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).
During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files
Starting an hour or two ago GLM's API endpoint is failing 7/8 times for me, my editor is retrying every request with backoff over a dozen times before it succeeds and wildly simple changes are taking over 30 minutes per step.
But it's all casual side projects.
Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.
For the price this is a pretty damn impressive model.
Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1
$1.40 in $4.40 out $0.26 cached
/ 1M tokens
That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.
I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.
Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?
> "build a Linux-style desktop environment as a web application"
They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.We all know that building a spec-compliant browser alone is a herculean task.
Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.
I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.
"I am the storm that is approaching, provoking..." : )
Excited to test this.
Being "better than Opus 4.6" is not really something a benchmark will tell you. It's much more a consensus of users liking the flavor of an answer, rather than fueling x% correct on a benchmark.
Everyone else isn't that far behind and they aren't all gonna just wall off their new model.
A reason that Anthropic will eventually give is 'the competition can do what Glasswing can do so what's the point limiting it'.
I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.
Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.
https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...
https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...
Who knew Anthropic was this far behind???
For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.
https://github.com/Opencode-DCP/opencode-dynamic-context-pru...
Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?
It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.
It's a fine model
as kimi did a huge amount of claude distilation it seems to be somewhat based in data
https://www.anthropic.com/news/detecting-and-preventing-dist...
I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.
So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.
I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).
However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.
Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.
It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.
I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...
All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.
GLM5 also had this issue. When it was free on Openrouter / Kilo the model was rock solid though did degrade after 100k tokens gracefully. Same at launch with Zai aside from regular timeouts.
Somewhere around early-mid March zai did something significant to GLM5 - like KV quanting or model quanting or both.
After that it's been russian roulette. Sometimes it works flawlessly but very often (1/4 or 1/5 of the time) thinking tokens spill into main context and if you don't spot it happening it can do real damage - heavily corrupting files, deleting whole directories.
You can see the pain by visiting the zai discord - filled with reports of the issue yet radio silence by zai.
Tellingly despite being open source not a single provider will sell you access to this model at anything approaching the plans zai offers. The numbers just don't work so your choice is either pay per token significantly more and get reliability or put up with the bait and switch.
The bar is very low :(
But I used 70m tokens yesterday on glm-5.1 (thanks glm for having good observability of your token usage unlike openai, dunno about anthropic). And got incredible beautiful results that I super trust. It's done amazing work.
This limitation feels very shady and artificial to me, and i don't love this, but I also feel like I'm working somewhat effectively within the constraints. This does put a huge damper on people running more autonomous agentic systems, unless they have Pi or other systems that can more self adaptively improve the harness.
[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]
Interesting.
Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.
Thanks for watching out for the quality of HN...
These are different from the submitter-passed-a-link-to-friends kind of upvoting and booster comments, which feel quaint by comparison. In this case people usually don't know they are breaking HN's rules, which is why they don't try to hide it.
There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.
[0]: https://www.reddit.com/r/DoneDirtCheap/comments/1n5gubz/get_...
[1]: https://www.reddit.com/r/AIJobs/comments/1oxjfjs/hiring_paid...
[2]: https://www.reddit.com/r/androiddev/comments/1sdyijs/no_code...