This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.
Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.
I used it unquantized through Fireworks, but there are multiple other providers too.
In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings
But when factoring in performance/cost, GLM 5.2 is the frontier model.
I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price. And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable. Almost like buying a Ferrari for your daily commute instead of a Toyota or even a Mercedes.
I think there are several factors. Certainly marketing making us think we need the shiny thing which is rampant online and very smart people think they aren't susceptible to. There's a lot of really odd 'I trust Anthropic/OpenAI more than Deepseek' which tends to ignore, for starters, that you can run choose your provider and still save a ton. I also think there's some amount of addiction and brand loyalty where a Ferrari is one hell of a drive so that you turn your nose up at that sensible Toyota. Oh the other one I see used is like oh only fable can oneshot updating my embedded systems thing from 1975 to rust which is great but let's recognize how niche that is.
And it ends up just coming across as people are getting SO reliant on the tools so fast. Maybe it's ok to think and like read a few lines of code and work with these agents to convert your thing to rust or center your div. Even if coding is over which in some sense it certainly is, don't turn your mind into the wall-e people yet. I found myself guilty of this so often. It takes way more time and effort to do things via prompt and I wouldn't just open the editor and fix it because that dopamine hit of the magic the abstraction provided was so strong.
So I'm pretty much done using the 'best' (on benchmarks, if money isn't an object, etc etc) models available. After a year on Sonnet/Opus/GPT5x I'm having way better results with open weights models that don't get lobotomized weekly. I'm finding ways to do the crafting part of building software by focusing on honing my harness and workflow. I'm enjoying changing the oil on my Toyota after a year of almost flying off cliffs in my Ferrari and if I can check my ego it's a purely positive thing.
Because even the best model available is not good enough for what we really want it to do once you start digging. The goal is not using the best model. The goal is using the least insidiously bad model.
> when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.
The starry-eyed falseness of this pretty quickly becomes apparent if you spend more time looking at it, IMO.
This is the logical end point of the fear-based way LLMs are marketed. You must want the best, because everyone who has the best can work faster than you, generate more — if you don't have the best, you are behind! Why would you want to use anything other than the best?
The thing is, once everyone has the best, the question is: how much can you spend? If you can't spend more, you are behind! If spending the most will get you ahead, why would you not want to spend the most, if you can afford it?
There is only one way through this, in the long run: work out a way forward that doesn't make you dependent on this cycle. If you can compete at all, without the spend, what happens is: they burn money and you don't.
FWIW so far I don't think the benchmarks prove very much about the actual experience, and you can discover this just as easily without spending any money. And we know this about benchmarks! Once a benchmark seems useful as a measurement, it becomes a target and it stops being as useful.
I think your strategy is right. It requires bravery, and as you say, it requires ego balance. But I believe it is obvious that the world will either come around to a more sensible, stable pattern or it doesn't matter either way because we're fucked. So opting out of this mad early cycle and choosing to be calmer and happier is a choice you can just make.
The difference is how the model is used.
With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"
With the lessor models the code is fine, but they need something else to plan what needs to be done.
GLM-5.2 is the third model (after Opus 4.6+ and GPT-5.5) that can do this agentic style work.
Notably Gemini 3.1 Pro is notoriously bad at this style work - the code is good, but it drifts off task most of the time. 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.
You are free to do you. But you were asking about why others want the best model.
The answer is, clearly, agentic coding (ie multiple agents each cranking through tasks independently) lets you ship A LOT more business value if used correctly.
And hey, don't get me wrong, you can get pretty far with just prompting. But the subtle misses and (I'm looking at you GPT) the overengineered 20k line PRs to do a simple thing are going to cost you a lot if you're not vigilant.
I don't think anyone is stopping you. This is an entirely valid way of working.
I for one am glad to leave that behind me. The sooner I never have to write another line of code the better (professional software engineer for nearly 30 years here, for context).
I am still struggling how to deal with sub agents and different roles for each model. I still think Claude or Codex are overall better models, but everything around them transpires such weird vibes, including, and this one kills me, that at certain times they feel like dumbed down.
I keep changing these things often, but I have basic subscription to codex (20$ plan) which I use with GLM 5.2 to do some high level planning of what I intend to do, and then leave Deepseek do the coding. Or something along those lines.
Point is, GLM 5.2 is now at a point where I cannot tell you if it's better or worse. I can tell you however one thing: no matter when I use it, it's consistent in what it does and how it works.
Then there is the Fable thing, but as with many things, I think the past has distorted the reality. It lasted two days, but Anthropic said it clearly for plan users it would only be there for two weeks. It was great for doing what you can already do with other tools: doing all the planning, and reviews, and launching a million subagents talking to each other. I sometimes wonder if it was really a new model, or just Opus 4.9 wrapped with some fancy model driven harness.
As for Fable: I used it as much as I could while we had it.
It was a step change over Opus with my work.
I've had no trouble getting the current generation of smaller models to do the same thing. Maybe it's more of a harness issue than a model issue?
Recently I've used both MiniMax M3 and DeepSeek V4 Flash to one-shot moderately complex applications from a written spec, and neither one got lost along the way
Price and speed, for me. GLM5.2 is "good enough" for some tasks, but rather slow (on their coding plan). In the time it takes GLM to "read files to figure out...", gemini flash is usually finished. It's not SotA for coding, but it's fast and often "good enough" for normal tasks.
For Flash 3.5?
I'm a big fan of Gemini 3.1 Flash Lite Preview (yes that is the name..).
I keep a agentic SQL benchmark up to-date to test new models. It's more-or-less saturated above 23/25 but below that is still useful, and even at that level is good for comparing speed, cost and toke efficiency.
3.1 Flash Lite Preview scores 22/25 in 142 seconds for $0.02. That's a great result if you care about cost for performance.
3.5 Flash scores 20/25 in 367 seconds for $0.76. The slow speed is because it takes a lot of tokens to generate its results, so even if tokens are produced quickly it takes too many to get a positive result.
There's nothing I've seen or heard that indicates 3.5 Flash is better than this indicates.
https://sql-benchmark.nicklothian.com/?highlight=google_gemi.... vs https://sql-benchmark.nicklothian.com/?highlight=google_gemi... (click the cells to see the traces)
I used to be obsessed with what's the best model. Then a while back when the new best model came out, I tested it on a task. I also tested its little brother (much smaller model from same company).
They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...
With the wealth of models available (open source vs closed, api vs local), I find optimizing the cost-efficiency of your token consumption an important part of business-oriented AI engineering. You don't need "the best" for every task.
Same for me, I certainly don't have the same definition of success and failure either.
A more expensive model has *less* rooms for wandering around than a cheaper model.
If Claude wanders around during 10min until finding the most obvious solution, then I count it as a failure.
Meanwhile, it's such a cheap model that I've spent not even $25 over 3 weeks.
This also allows me to play with, and mix codex, claude cli, and others. This is my happy spot for the last two months.
In my case, I rarely ever go over the Claude/ChatGPT subscription limits, so might as well use those considered-best models. If I had to generate millions of lines of code, maybe I would've used the open models more.
One group is trying to get the LLM to basically one shot everything and not properly reviewing the output.
Others are using the LLM to assist their human intelligence in a tight loop.
If you’re doing the former you really do need the best model available because that’s still right on the edge of what LLMs can do at best, and at worst you’re just shipping pure unmaintainable slop.
If you’re doing the latter then you can get away with a slightly less powerful model without it making a material difference because your human intelligence is filling in gaps
I generally started by reviewing but after a while (maximum in hours), I just can't keep up and resort to LLMs as sole reviewers.
While the single functions/classes/structs/... can be well though out the code tends to lack cohesion, and especially maintainability. For instance, it never thinks: "I could put this logic in an interface/trait so that if the requirements change I can simply add a concrete implementation that satisfies the new requirements (and potentially use one of these for testing)".
SoTA models can do reasonably good jobs on each ticket, but over time the architecture of the application starts degrading without a human in the loop.
The entropy increases slower with better models but the trend is always towards slop
It's not fast-changing, it's not abstract, it's just not that difficult, and where it is difficult, the AI cannot help you, because it is not capable of things you are capable of.
Learn CAD yourself. Honestly; I was sure I would never manage to learn CAD but it turns out to be interesting, rewarding, valuable and actually quite quick to learn.
An LLM certainly is not going to be able to do it better than you once you have a tiny bit of experience. (PCB design, perhaps, has a language to it that an LLM can make a bit more headway into, but as a non-PCB-designer I would still bet that it's more like CAD than code)
It has been hard to explain that they are in fact just creating toy versions and there is no way they can do it without learning the underlying architecture. But they just keep going wasting 100s of dollars , lost in a sea of bugs
Dabbled with OpenSCAD as we will. I decided to learn FreeCAD and what I discovered is that, even putting aside FreeCAD's many documented issues, parametric GUI CAD is not an imprecise, clumsy or fiddly way to work.
It is expressive, precise, generally capable of all the things that code-CAD can do and much more, and it's much, much quicker to work in, once you've learned a few core principles.
As you say, there is an underlying architecture; it's not just a sort of 3D paint package.
The problems the text-as-whatever crowd have are all Dunning-Kruger things in the truest sense.
People who are unaware they are unskilled in a particular technology are unlikely to successfully replace it with another. Particularly one that requires describing the problem domain in precise language.
Quite often when you see text-to-CAD discussions, especially here, there's evidence of profound misunderstandings from the people who think they are going to automate it. They assume their frustrations with the tools stem from limitations of the tools, not from the limits of their understanding.
As a person with decades of experience of code I have found learning how to use LLMs effectively to be much, much harder than learning CAD.
I've been programming since I was a kid. I enjoy it a lot, I like knowing how things work, I get excited about new compiler features, I stayed up every night for a week when I discovered Lean 4, etc etc etc.
At the same time I realized a few years ago that I just don't want to write any code ever. Or read any code. Coding is addictive and fun, but I'd rather talk to the computer and have things magically get done. (FWIW learning how to use LLMs feels more.. fulfilling, too)
Anyway. GLM 5.2 is nice and all, but I might have to spend half an hour guiding it to come up with a plan I'm happy with. And with Opus it could be 15 minutes. I'm still going to spend an hour talking to LLMs one way or the other, but with Opus it will be a less frustrating hour. If Fable gives me a frustration-free hour, I'll switch to Fable.
The reason is pretty simple and has to do with statistics: on long-horizon tasks, small errors and deviations from the "good path" compound.
That changes as soon as the developer is the one paying for a model. Then it's a classical engineering trade-off between money and quality, and that's where open models are clear winners.
The problem isn't what they do in a blank state. It is how they get there and the edge cases. Some models also take longer (uses more steps) i.e. end up costing more despite being "cheaper".
I've seen models:
- Back out plans non-stop. Tried the obvious path. Invents X/Y/Z excuse (without verifying) that it can't be done. Notes that down and moves on. It could be as simple as site A being down and to download from site B but that's it.
- Hacks the test to make it work. Code is wrong? Nah, let's update the test.
- Keep saying useless things like YAGNI and infinite excuses like too risky to never do the work.
- Claims they are done but there's 100 edge cases not covered. When you try to use it it fails in ways you as a human assume it should work. You can write a spec to cover it all but then what's the point?
- Be trigger happy and never investigate. Tries to do it. 5 minutes. Oh it failed. Back out. Repeat. Better models definitely spend more time analyzing and actually "think". I've had models spend hours trying to do a change due to this method when an actual investigation (code walkthrough) might have solved it.
- Know and use the right tools. A lot of lesser models have infinite fear e.g. oh docker might not be available (it is) or this and that (even if you nudge it in any way) and spend a lot of extra time "working around" it.
The list goes on. Better models definitely help.
Only thing to agree on is no you don't need Fable but saying Sonnet can do the job instead of Opus is a different story. It's so obvious when Sonnet touches the code that I can't give it more than 5 minutes. It lies. Doesn't check. Forgets things and then messes up.
That’s not super surprising to me, but, given the apparent randomness of the stack ranking, is GLM actually worse than any of the Anthropic models? This looks like a 10-way tie to me.
We now use Sonnet 4.6 for a number of internal use cases we wouldn't have considered otherwise.
4.7 was so bad, I locked a bunch of my machines to 4.6.
I haven’t bothered locking the 4.8 machines to 4.6. There was a HN thread a while back where they run swe bench a few times a day and measure success rate and latency. It showed opus getting significantly dumber for the week before a recent launch.
It wouldn’t surprise me if they’re quantizing to improve margins or to hype models in comparative testing in order to defraud investors at IPO.
Or, maybe QA is hard. Anyway, I think they hit a performance wall sometime at or before 4.6.
Just want to express how amazing that is. Opus 4.6 is an amazing model. That an open weight model like GLM 5.2 competes with it is nothing short of outstanding.
On the contrary, I personally think these broader benchmarks are meaningless. I think personalized benchmarks are the way to go. They should answer "How does this model perform for MY use-case?" rather than trying to answer "How does this model perform across all coding environments?"
Case in point: I use Elixir which is not as popular as Python, is always a hit or miss with most SOTA models at the top of these benchmarks. Whereas, the ones in the middle of the benchmarks (like the GLM) almost always outperform even SOTA models from Google / Anthropic. However, this is relevant only for my use case and I wouldn't just advocate a model for everyone based off my use-case alone.
We test 11 popular/interesting languages (you can see the Languages chart to filter), but not Elixir -- although other evaluations have found that many LLMs solve more problems when working with Elixir [0]. Why models write code well in some languages over others seems to go beyond pre-training data (Python scores quite low for most models) and we don't fully understand it.
[0] https://elixirforum.com/t/llm-coding-benchmark-by-language/7...
Sounds like this is the year for coding.
They did this to themselves.
So having a model of 4.6 quality is still extremely awesome. That currently is more of less the frontier reference outside the US :(
Congrats, now you’re paying an engineer’s salary to make your engineer at best 20% more productive.
Better to hire another engineer, or two jrs, and build up your in house talent.
Most of the gains right now come from tooling and process and any big post 2025 language model. The specific model isn’t that important right now.
It doesn't have a higher capability score than Fable, though. We break our coding evaluations into 2 parts, and "one-shot coding" makes up part of the index, where Fable significantly outperforms every other model, which is why it's ranked at the top despite Sonnet 4.6 having a slightly higher median (and lower average) in long-horizon agentic workloads. One-shot coding tends to be the most correlated with other companies' model cards, whereas agentic coding is partly about how well a model can adapt to a custom harness. Fable also refused some tasks.
Data at https://gertlabs.com/rankings?ow=1&mode=oneshot_coding
When I first used Opus 4.8, I threw several different workloads I had at it - I have Claude doing a lot of misc projects whose primary purpose is pretty much just studying what AI agents can do for my own curiosity and no other reason. Opus 4.8 was one of the first models I ever snuck in there that basically ran out of control. No previous Opus or Sonnet model I had used ever did this. Within hours every agent I had running was writing non-sense tool calls that echoed pretend commands that didn't exist, like 10 in a row, and talking about the "tool channel" being dirty. I switched back to Opus 4.7 and assumed Opus 4.8 was legitimately just broken.
I did come back to Opus 4.8 and found that it was indeed, pretty powerful. But that initial experience has stuck with me on just how narrow of a perspective any given test or benchmark is guaranteed to have. LLMs are too broad, it really doesn't matter what you try to do in your benchmark, you will necessarily get a limited view of what the model is capable of and its shortcomings. This will remain true for at least as long as models are susceptible to massive swings in performance based on randomness and minor differences in prompts and other environmental factors.
I'm not saying benchmarks are useless or that your benchmarks are not possibly closer to the truth either. All evidence at least points to the idea that Chinese models perform very well in coding but often have more mixed results on other tasks. I'm just saying that at this point, benchmarks feel like they have limited connection to my actual real experiences. GLM 5.2 actually scored kinda meh on a lot of benchmarks (compared to closed frontier models) but my actual experience using it does not match this.
And I'm definitely not saying GLM 5.2 is better than the frontier LLMs here, just that the race is close. I still prefer GPT 5.5 right now for code review, I think, and Opus clearly has some advantages depending on the task. It's just no longer a given that Opus 4.8 will perform better than GLM 5.2 on any given task, so to me the calculus behind "using the best model available" is getting complex and you might need to get a feel for what models have what strengths to really figure it out.
I do feel like the "use the best model available" mentality is not going to die any time soon, but if it does die, it will be gradual and start soon for programming. Modern LLMs are still not a full superset of what human programmers can do, but still larger models are definitely starting to hit diminishing returns for tasks at the lower end of complexity, and that is a big deal. It's a weird world where some tasks you can feel kinda confident just throwing Gemma 4 at it and not sweating whether you should use a better model; I've certainly done it for some quick Python scripts or getting an overview of some code I'm unfamiliar with.
Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.
I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.
https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...
Notice he's using "trust me bro" benchmarks.
Can we just remove all the motivated speech on HN? This is just not trustworthy information at all and obviously is incentivized.
Everyone is grinding and marketing nobody is actually discussing anything for real.
Anthropic won't even let you run "claude -p [prompt]" any more... They bill it at api rates.
So if you're trying to automate the ai (and seriously, that's the point) the subsidized plans are crippled.
> In May, we sent you an email announcing that starting today, the Claude Agent SDK, claude -p, and third-party apps built on the Agent SDK would stop drawing from subscription rate limits and move to a dedicated monthly credit. We're writing to let you know that we’re not making this change today. We’re working to update the plan to better support how users build with Claude subscriptions.
> What this means for you
> Nothing changes for now. Agent SDK, claude -p, and third-party app usage continues to work with your subscription exactly as it did before today, and there's no credit to claim. Your subscription limits are unchanged. When we have an update, we'll share it with advance notice before it takes effect
I see no reason to trust Z.ai more than other vendors.
But there are EU only providers for GLM5.2. For example tensorx. Depending on your definition of "secure" it may be acceptable.
I have not tried it but I will take your word on it. I don't think Qwen3.6 cuts it for large scale coding work. Reading issues, reading code sure, but biting into large issues no, it goes off the track consistently.
Depending on budget it may also be affordable to spin up servers to run it on demand.
For real work anything below 60 tokens per second is essentially unusable. That's not taking into account the prompt filling, Llama 3.1. 70b on DGX spark runs at about 800 tps running at that speed prompt filling a 512k context takes like 11 minutes.
We definitely don't have any intention to obfuscate and in fact we actually try and provide more data than any other provider out there about both an individual request, as well as the fleet behavior. Since we tend to focus directly on our energy pricing and optimizing that the issue is likely where the ROI lies on energy optimization versus token optimization (totally correlated but we have other levers to reduce energy while keeping token counts the same).
So your question is really “if they’re giving free usage, why not take advantage of it?”
I do, so I don’t know the reasons not to, other than to experiment.
I kind of wanted to see if I can make a Matrix agent from scratch with Rust with GLM and it was surprisingly easy. Just make something for myself how I want it. Maybe I'll take a look on Hermes later...
I followed this example
but I'm running into issues with nested backticks so I'm thinking of making dedicated close tags per tool call.
I don't think a $100 session is "typical". I use GPT for months. $20/m plus plan is enough for my daily work.
My typical session ranges from $100-$400 - higher end when using workflows with lots of subagents. $100/session is expected when using the API without the subsidized subscription pricing. Most larger orgs have to use API pricing AFAIK.
There are Business and Enterprise plans, both have discounting.
I'd blow through $20/month plan in hours.
Likewise, I don't understand how anyone survives on the basic plans. It's funny seeing these two camps not understanding what the other is doing :)
In the future none of us do, so it's better to trial how the actually adorable models perform.
I understand the reasons to use team/enterprise accounts, but apart from the policy/management/billing side of it, I still don't understand the value in spending thousands for API instead of hundreds - even when there's argument that one provider is better than another depending on the use case, I don't think that credibly extends much beyond OpenAI + Anthropic frontiers, which both have $200 subs you can stack.
Did you program or did you gave the order to an agent to program?
How are you comfortable spending that much to write something as simple as a matrix bot?
Are people doing this kind of thing just super rich or am I missing something?
Same thing for stuff at work. Oh, the tables/schema changed and my queries broke? I could dork around with spark and cypher for an hour, or I can tell claude to update the queries for the new schema. At the rate I am paid, spending on Claude tokens is generally a better use of my resources.
Building a net new solution? Coding tools take a back seat until I get the core logic right, then I let automation handle web page and UI scaffolding.
https://swelljoe.com/post/will-it-mythos/
Also of note, I found giving models access to the open source semgrep as a tool makes some perform worse and none perform better, though it's plausible there's a way to wire it up in a harness that presents useful information to the model without the model having to know how to use it (my theory is that semgrep isn't heavily represented in the training data, so you're asking the model to do two things at once: figure out how to use semgrep and find security bugs, and both tasks suffer for the lack of focus...most small models, and some big models, can't do that well).
Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.
Overall, I still think GLM 5.2 is the much stronger performer. It's hard to tell the difference between GLM 5.2 and Opus at <120k tokens.
I'm inclined to use DeepSeek V4 Pro the most, because it is consistently extremely strong, it's very fast, it's very cheap and has excellent caching and cheap-as-free cached input tokens (something like 80% of token usage is cached when I'm using it for security scanning). So, my probably "pair" of frontline security researchers will probably be DeepSeek V4 Pro and Gemma 4 31B self-hosted (another shockingly strong contender, competitive with the best models once you let it loop on the same file a couple/few times). But, I won't be surprised if GLM 5.2 turns out better than DeepSeek V4 Pro...it costs quite a bit more.
I've benchmarked it, and the "here's a repo, find bugs" approach finds far fewer bugs. Like, dramatically fewer. Models are good and contexts have expanded, but focus still wins with hard problems. You could probably tell the good models to make a plan to audit the repo, and it would end up making its own "loop" in the form of a checklist of files to look at over several sessions or via subagents, I assume.
Not sure if helpful but in my experience when something a bit more complex needs to be done, manually making it read the context I know the model will need for it to solve it well (like making it consume all the project docs first) helps with getting a more satisfactory result instead of only giving it the task and let it look around and consume the context it thinks it needs.
Will test your bug finding method in a current project of mine both with my "manual" context preloading and without.
Deepseek v4 pro being in preview with less "safety" training makes it stronger for that reason. Thinking will be different and in the end, it will actually try to be useful. Just expect future Chinese LLMs to further push out "safety" guided LLMs. The future is bleak for open weight models. Prepare to have "guidelines" enforced unceremoniously to all.
Fable found a couple of good ones, then we lost Fable, so I tried GLM5.2 and it found two critical bugs that Fable had missed, so it got my seal of approval.
…probably already is one
But, there are benchmarks for so many different kinds of ability, I don't know how to compare them directly against one another. Like, models that do well on terminal and agentic coding benchmarks tend to do well on finding security bugs, but it's not a 1:1 correlation, there are surprises.
I like reading benchmarks, but I take them all with a grain of salt. They're just to tell me if the model is worth even trying for my task. I've heavily used self-hosted Qwen 3.6 and Gemma 4 on a bunch of different tasks, and while the benchmarks consistently say Qwen is the better model, I simply don't find that to be the case for anything I do. I think Qwen is tuned for benchmarks, while Google couldn't give two shits about most of the benchmarks, they're just busy making unusually smart tiny models.
Anyway, it isn't possible for any of the models, so far, to be trained on the Mythos bugs. We're getting closer to the point where I have to worry about that, at which point I'll roll forward and pull some newer CVEs from what they've published, assuming they keep publishing new bugs. (And, if they don't, it's trivial to switch to just random CVEs. But, finding out what Mythos is up to is interesting.)
Thus companies who still try to have humans perform intertwined work with their AI won't see an improvement, while the ones who fin the right conditions to give their AI more free rein will see it.
Kind of like it's no use having a workhorse pull a combine harvester : at some point, when machines reach sufficient efficiency, you just give wheels to the harvester and let it run.
I have a feeling Anthropic may still come out cheeper (mainly thanks to enterprises subsidising the Max subscriptions).
But I'm very excited with the possibility of using fully EU based inference rivalling Opus in quality.
A few people are running highly quantized models with limited context windows. It's still impressive, but not the benchmark level intelligence. Very few people could afford a rig for reasonable local performance at a reasonable quant, at full context size.
The antirez example is 2.6bit quant, 32k context, and few tokens per second... on a ~$7000 MacBook M5 (new RAM pricing).
Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.
For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.
8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.
It's going to be $120K to $150K to build or buy a system to run this.
But hey you could save on heating?
A single circuit using 10mm TPS would technically be enough to run what you’re describing. Might be pricey though, I’d probably take the excuse to get 3 phase installed so I could get access to the stock of used 3 phase machinery.
In the US it's common to get 200A 120/240V split-phase service. We're talking about the wiring inside the house, though.
How do you think everyone here is charging their electric cars at home and running our AC and electric cooktops at the same time if we didn't also have that? :)
You need to derate for constant loads here, and I assume you have to do that in NZ as well.
So, no, not a "uniquely US issue".
Or even just electricity costs vs token cost
The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.
Not sure if you're being sarcastic, but I can run a quantised version of Gemma or Qwen on my 16GB M1 Macbook Pro that beats GPT-4 from 2023 hands-down.
I wouldn't be surprised if, in another 3 years, you'd be able to run something as powerful as Opus 4.5 or GLM-5.2 on standard consumer hardware - say a 32GB/64GB M7 Pro.
I also wouldn't be surprised if, 3 years after that, cheaper hardware and improved model efficiency means that there's a much smaller gap between what you can run on a consumer CPU (which, with memory prices coming down, could look like a 256GB M9 or M10 Pro) and $100k GPU cluster.
We've been sat with basically the same PC specs for ~20 years - our current specs are within an order of magnitude of the ones we could buy back in 2010. This is not really constrained by tech, as we could have much, much, larger machines. It's more because there's no mass demand for much, much, larger machines - if it's big enough to run Office apps or VSCode then you're good to go. The exponential growth we saw in the 90's was driven as much by software demand as it was by hardware development.
I can see the next 10 years produce the same kind of push for larger machines that the 90's did. And we should probably expect the same kind of standards churn as our existing technologies for storage, memory, etc, don't scale up enough and new technologies become worth developing because there's demand for them.
My productivity profits from the best intelligence available, a decent context size, and a batch size of four.
While my MacBook has 48 GB of RAM, not only do I want the above requirements at a decent speed, but I also need my machine to run the development tools and test suites, ideally without the fans blasting at full load.
For the foreseeable future I will stay with providers rather than local inference, apart from niche use cases.
I'm in Australia, so we're probably not getting access to Fable again. We're learning that a faster model + better harness/framework > smarter model. So being able to run GLM5.2 locally and super-fast would be great.
But the existing tech we're using for 16Gb probably isn't going to scale to 16Tb at a reasonable price point. And the price point is relatively inelastic - people are used to paying <$5K for their computers, and they're not going to go much above that. You'll get early adopters paying $10K or more for a machine that large, but not the early majority. And even then, obviously, $10K is not going to buy you a 16Tb memory machine.
So there's room for a new technology to come in, where there wasn't previously. This is what happened all through the 90's, and we churned through a bunch of standards and technologies to try and keep up with demand.
Are they?
I suspect AI labs are buying stuff not just for their own use, but to make local use too expensive to be an option :-( And they can always make the "best" frontier model even bigger (though only fractionally better) so it's always out of reach of local use, while consumer laptops have nearly the same amount of memory they had a decade ago.
m o
o
d
e
l o
s
i o
z o
e 2020 2022 2024 2026
c
h
e
a
p o
R o
A o
M o
2020 2022 2024 2026Prices aren't going down, and consumer platforms are being shipped with less RAM so we can be sold cloud products. This isn't going to happen.
Can you please explain to me how you're going to fit 700bb-1T params in 64GB of RAM? You realize there are memory requirements proportional to model size?
You don't. What they're saying is that today's small models (that fit on consumer hw) are better than yesteryear's top models. GPT4 was reportedly 8x 220B (~1.6T) MoE, and today you can run a 30-120B model that beats it handedly in real-world tasks.
Similarly for 4-20B models beating GPT3 (175B) and so on.
There is a sweetspot of "good enough" that the small models can reach, where you get equivalent tasks solved fully locally. They'll never touch SotA, but they'll reach 2-3-4 year's SotA. Which, depending on the task you need, it can be "good enough".
Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.
If that's anywhere near right then it seems like a no brainer.
The input cache hit tokens are incredibly cheap for them, (incredibly high margin too, except for deepseek).
And input tokens are in the middle. Input tokens can be processed very efficiently.
Also his math is wrong. $100k gets you 22.7B output tokens at $4.4/M which is how much GLM 5.2 costs.
At 500/s 22.7B is just 500 days. Or about 1.54 years. Which is much less then the life of the hardware.
concurrency
oil workers buy 100k trucks they do not-much with. why not a 100k in computer?
Some, and the market fluctuates a ton.
> corvettes
Only the oldest, most unique model years: nobody is buying (C4-C5-realistically C6) mid-90s or early 2000s Corvettes for more than what they paid for them, and they never will.
Both of those things' value drops like a rock as soon as you buy them and, at least for cars, they don't all appreciate. Most don't. Even so, they appreciate at an incredible slow rate.
I can't speak for watches but I'd be surprised if it wasn't the same situation.
At least the gpus can create value after you buy them before they are worthless.
I assume (since they claim they are selling the batteries to AI data centers), they’ll produce some sort of EV >= F150 once the bubble pops, and we get a new president.
EV is a separate thing. Vastly overmarketed for the technology as it exists today.
Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?
The compression is almost certainly in part specific knowledge getting fuzzed.
Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.
Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.
It's hardly self-evident, and your counter-example is hardly applicable.
The first 10^50 of pi is not the same as having BREADTH of information in the training data, which is the whole point not just any random "information that is irrelevant to your use case".
not to mention that the first 10^50 digits of pi compress to quite small formula, so not much information there to begin with from a shannon/kolmogorov perspective
The memorization of say 100000 world facts through training texts, which enrich model associations all around, is absolutely not the same as rote memorization on 10^50 digits of pi. Not for a human, and even more so, not for an LLM.
An LLM trained with digits of pi and one trained with books and posts, even if they both have the exact same amount of bytes of training input, would not be comparable in any way in utility and reasoning capabilities.
>There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.
Which is irrelevant. Anyway, the amount of information that doesn't form useful logical associations is even larger (e.g. actual human books vs possible permutations of characters and spaces). Just like those (random) possible permutations of characters aren't good for LLM input to get logical associations out of it, pi isn't either (logical associations of the kind we care for and expect, not of the kind related to pi's sequences).
Also it's not only not self-evident, it's also apparently wrong.
You're making the assumption that anything produced by a human necessarily contains more useful information than random noise does. This is false. Even when only considering human intelligence, it's entirely possible to absorb information that makes you stupider, not smarter; learning is only valuable if you actually learn the right things.
I'd say this exchange is a fine example of that :)
We don’t understand AI or natural intelligence well enough to make such statements. As for self evidence, cross-domain competence in humans and the rise of generalist models over domain-specific ones (on competence, not cost) seems to pretty directly tank your hypothesis.
If you believe this then you don't understand AI or natural intelligence well enough to refute my statements either.
Perhaps you're trying to refer to something specific by "cross-domain" competence, but firstly, humans vastly overestimate the extent to which experts in one domain can be trusted to speak accurately on topics in other domains (this is a form of authority bias), and secondly, real cross-domain expertise is a result of pre-existing metacognitive ability such as keen reasoning ability, intense focus, and learning-how-to-learn. In other words, Leonardo da Vinci was not a genius because he was a polymath; he was a polymath because he was a genius.
Likewise, I see no evidence that "generalist models" have proven anything about their ability over domain-specific ones other than that the big AI firms seem to believe that "generalist models" are their golden ticket to AGI and therefore a quintillion-dollar valuation. It's obvious in the long run that tools built for specialized tasks will outperform generalist tools for specific tasks, in the same way that a multi-axis CNC mill does not outperform your bog-standard lathe for shaping objects with rotational symmetry, or perhaps more pertinently to this conversation, how no LLM will ever outperform Stockfish at chess.
assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.
Or is there a business model I’m missing?
There are many layers of Chinese govt. But GLM is backed by Beijing municipal govt and Tsinghua University.
As far as US EVs being subsidized early, if you take state and federal tax incentives, DoE grants and loan guarantees as subsidizes then that's true.
It's debatable (I think incentives applied to all suppliers not just US ones) but a reasonable statement.
so Tesla technically is subsidized by US govt. SpaceX too. Without NASA funding, they'd be long out of business.
China and US ain't that different.
China realizes that being a tech and industrial powerhouse working on future tech is great for their economy. They bet huge on it. That's how they win.
Europe on the other hand is now a laggard.
In other words, they're not subsidies for Chinese cars being exported abroad. They're not even directly paid to the manufacturers.
GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.
I expect future Chinese models to introduce even more of this type of bogus "safety" training.
Looks like if you are a white hat, then you will be fighting an uphill battle. Black hats will be fine, they will not care, they can just run a heretic model or specialty trained model.
Mythos level really doesn't seem that scary. And it would be a great way to take away the American labs international market.
I think it would make strategic sense for them to release more capable models than what American labs are allowed to make available to the world. It would help them grow their global soft-power and be a destabilizing effect on the American economy.
China could not be happier.
The same model is going to apply to the silicon supply chain as well is my guess. 1000th the expenditure in exchange for being a little behind the curve.
I worry it will have a very real chilling effect on research and development, since customers will probably very quickly switch to the thing that costs 1/10th as much, sucking out the ROI.
Care to give more context to this? Seems interesting
I have more vendor lock-in with Design than I do with Code, and will switch over as soon as Claude loses the smallest technical advantage
Not that it would make any sense.
Any prohibition on open source models will do nothing to fix the problem.. since attackers will never feel bound to the law. All advanced models must be available for defensive purposes.
If the real motive is profit, then open source models are likely simply not a viable means to that end.
If huggingface or whatever is forced to take down open source licensed weights, there’s always bittorrent.
Export controls are one thing, but the US doesn’t really have import controls, and there’s no copyright issue, so DMCA, etc don’t come into play.
It’d take the courts years to decide how to contort the law to ban open weight models, and by then, it’ll be too late (and also pointless).
But that's the whole point.
Fall out of favor with the admin and you lose access to the good American models, aren't allowed to use Chinese ones, and fall prey to the attackers and behind your competitors.
Which would be fine, but as we know, people securitize the crap out of their investments these days, and least some people probably leveraged themselves on some US AI companies, so now the risk is spreading outside of the sector to the economy in general, which is made worse by the sheer amount of spending on AI.
I’ll happily pay a 100% tariff on open weight models, and there are no regulatory hurdles for them to jump through (yet).
Yes, you get your free model, but the cost of this is not developing your own capability and tying your fate to a country which may or may not have your best interests as a nation in mind.
This is just the deindustrialization that occurred in my home region (the American Midwest) playing out on a global scale in different sectors. It was originally driven by the Japanese, who, to their credit, acted more as partners than competition. Eventually that desire for larger margins went to China, and now you basically can't build anything of consequence without at least some Chinese parts, because there's "no economic case" for it. This means that you have to play Beijing's game if you want access to any sort of modern market.
You see this happening with Volkswagen's restructuring, next you'll see it with non-American, non-Chinese AI.
Over the last decade, the US has been way more unreliable than China. There's been a near constant negative impact from the US doing something.
At least with China, we are very good at winning trade wars with them here in Australia.
Or, on a more local note, an Australian automotive worker who worked for a company that figured out 10 years ago that they wouldn't be able to pay him a decent wage, compete with the then-upcoming Chinese EVs, and remain profitable.
There is no good guys in general, and whataboutism and making the scope bigger doesn't help.
The thing is that if the models you are building on are open source whether hosted on chinese / american / whatever service at least give you an option to switch provider easier vs a fable / chatgpt 5.6 that gets banned for none americans etc...
2 years ago america would have had the branding/perception advantage but right now that is well and truly gone...
Stop pretending there’s some type of moral high ground there isn’t. Disgusting.
man you're gonna be disappointed when you learn where the components for Ukrainian drones come from (spoiler alert, it's China 95% of Ukrainian drone manufacturers use Chinese components. Both Ukrainian and Russian drones are Chinese components glued together, the vendors in China literally stagger Russian and Ukrainian buyers on the factory floors to not have them run into each other). The largest trade partner of Vietnam and the Phillipines is China.
The kind of thinking that assumes that rivalry implies deglobalization or bloc politics is exactly what's 30 years out of date. It's projecting how Americans think on the entire world, but that's not how the world works any more. The rest of the world continues to globalize, even through war.
America is undergoing Sovietization and erecting an Iron Curtain, and China ironically enough is simply doing what the US used to do. If Americans think the rest of the world will follow them into isolation they're going to make the same discovery the Russians did in the last century.
I don't understand what your point is? This seems like a perfect example of comparative advantage - Australia can produce iron ore cheaper than anywhere else in the world and even when China launched a trade war against Australia the Australian economy kept growing.
There wasn't even any bump in unemployment from the closing of the car industry.
Once that trade war was settled, Australia got cheaper cars, China got cheaper iron ore and both economies won.
The rational behavior on both parts there is in stark contrast to current US policy, which is unpredictable and capricious.
> You might feel differently if you were a Filipino or Vietnamese fisherman whose family relied on the income from the stocks of the South China Sea, or a Uighur person living in Western China, or a Ukrainian soldier who has to deal with drones built with Chinese components, or a democracy advocate in Hong Kong, or arguably, a person who had plans for 2020-2021.
This seems like a random list of complaints about China and I agree with them in general.
I think you'll find most major powers have similar complaints. There certainly are against the US - I think you might find that both the Philippines and Vietnam(!) have fairly mixed feelings about the US for example.
I’m sceptical they could find the legal framework to do this even if they wanted to
They have legal authority to (a) prevent export of US goods/services; (b) ban imports of physical goods; (c) ban transactions (including purchasing services or license agreements) with foreign firms
But I’m not aware of any legal authority which lets them ban US firms from running a Chinese-developed open source AI model in the United States, if they are at arms length from the vendor, and aren’t using it for government contracts or regulated applications
Possibly they could order HuggingFace/etc to suspend Chinese accounts. But if someone in the US (or a third country) downloads the model from China then reuploads it to a US server, completely independently of the vendor - where is the legal hook to prohibit that?
This would be extremely heavy handed and probably end up accelerating the loss of the virtual US monopoly of payment network. The reast of the world isn't going to let the US dictate that only they get the frontier models whether their US made or otherwise
Can they actually though? Do they have legal authority to tell a payment processor that it has to block transactions of a legal US company, just because the company is hosting a Chinese-developed open source model? I’m sceptical
And what about companies (e.g. AWS) that let you “bring your own model”?
That's sanctioning specific individuals for specific acts they performed which the US claims contravene its interests and those of its allies.
I don't agree with the ICC sanctions, but it really can't be compared with the proposal "sanction any company, even US domestic entities, which use a Chinese-developed open source model".
In fact, I think part of what enables the US to sanction them (under US law) is the fact they are neither US citizens nor residents; if they were US citizens living in the United States, I don't think the President would have the legal authority to impose those kinds of sanctions.
They could sanction Hetzner–because it is a German firm based in Germany. I don't see how they could sanction a US firm based in the US whose owners and staff were US citizens.
Also, the 5th Circuit Court of Appeal decision Van Loon v Treasury (Nov 2024) is relevant–it held that IEEPA (the law used to sanction ICC officials) couldn't be used to sanction the Tornado Cash smart contract system, since open source code wasn't "foreign property" under IEEPA.
I agree, my only caveat is that the current administration has shown it's willing to go beyond aggressive regulatory interpretations to questionable and outright implausible interpretations. As we've seen recently, the federal courts and SCOTUS are overturning most of these but that can take a year or more to resolve. The one positive light is they seem to push the hardest on certain culture war issues (immigration, voting, districting, etc). AI doesn't seem like a core hot button issue for the White House and there is a strong pro-AI / business faction.
US imposing export restrictions on a model from China?
The weights are already available and downloaded, is it going to be a crime to have them, run them, make them available? Constitutional rights still exist (I hope)
Now you're getting it! Commerce will call it a munition and those harboring it as harboring illegal/foreign munitions.
No business will take the hit, so they will quickly deplatform the models.
No end user has the GPU capacity to use GLM 5.2 or similar models at full precision so the government will call the problem "mostly solved." But they might choose to "make examples" out of a few people using p2p software to download the weights if they choose to.
I'm for making software better instead of banning it based on what the rich and powerful claim.
I suspect the real fear is that open weight models undermine the financials and token prices they thought were going to pay off their ludicrous spending because they have all raced and raised hardware prices.
We're still in the middle of the cambrian explosion.
If Anthropic was capable of developing Opus 4.49-4.5 2H 2025.... then any company with a research team capable of reading all the papers and press releases will be capable of producing Opus 4.8 by the end of 2027, either raw model competency, or in a harness like claude code (or better with both). I guess what I am trying to say is that Opus 4.5 does not represent the edge of agentic capability, merely somewhere in the thick meaty layer of "functional and achievable".
We can draw the line at Sonnet 4.6 in the US but much like encryption export restrictions in the 1980s, the line drawn will be laughably low within a few years and simply unthinkable in a decade.
That would be the rational thing to do.
> financials and token prices
I do not think the government thinks this deeply. Market manipulation might be a rational, if unethical reason to ban open source models.
But this admin banned Anthropic models to "own the libs." They will continue to ban what they want for whatever reason they want. I don't think those reasons will be particularly coherent.
Yeah. Illegal numbers.
If you classify AI as a weapon which seems to be the direction that we are all heading towards, they yes first amendment rights won't likely apply.
The reason GLM-5.2 hasn't been banned is that despite these cherry picked use cases, GLM-5.2 isn't even close to Opus in all use cases. These vibe benchmarks are ran by companies that are not part of the cyber services offered by Anthropic and OpenAI where they can use the models without the safeguards and refusals so their actual cyber capabilities can be utilized.
These guys that wrote the article compared a gimped Opus to GLM-5.2, knew full well it's misleading, and got the clicks regardless. They don't have enough clout to be a part of something like Project Glasswing, GPT Cyber, etc.
The others are a waste of taxpayer money. Extraordinarily low return on investment (kill-on-investment?)
Claude Code is an agent harness, not an LLM.
Claude is a brand (or group of LLMs), not an LLM.
Opus 4.8/4.7 scored 28%
Opus 4.6 score 37%
So the author thought as let's not get into that just write Claude.wild guess - I wouldn't be surprised if Opus 4.6 was run quantized for a while, and 4.7/4.8 have QAT for that nerfed size.
Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.
Thirdly it compares to GPT 5.5 and Opus 4.8.
No, we don't have Mythos at home.
mythos is <10% ahead of gpt 5.5 on all benchmarks, which it gains by being several times the size of opus. had it been economical to provide, it would've been released to the public on day one instead of the marketing circus those effective altruism clowns had exhibited. admitting that it costs >1000% to run inference on a <10% better model would've been very damning.
do you have a source for this claim? i thought LLM providers earn high margins from inference (charged by token). is this no longer the case?
no one has a source, because no one knows closed model parameter counts. we have only heuristics which strongly indicate that Mythos is simply a big fucking model that any other lab could make an equivalent of.
The only ones who seem to profit are the ones running smaller Chinese models. Even NVIDIA seems to have to "reinvest" their profits into sponsoring companies to buy their cards now.
> No, we don't have Mythos at home.
That's still useful. To paraphrase the kids these days, GLM5.2 is in the room with us, today. Mythos is not. And for us in the EU, it's even more complicated, as Mythos might be with us in the room one day, and go poof the next day, on the whims of political entities that we have 0 control over.
Knowing where open, accessible, local models are is important. We know they're behind. But there comes a time when "good enough" is useful. Even if they're "just IDORs" today, and even if they're behind SotA today.
As someone else said above, GLM5.2 (and other models in the same tier like kimi, dsv4, etc) is / are slowly becoming "good enough" to assist in automated repo prepare work (download, install, test, edit, re-test, etc). And that translates in RL traces ready to be trained into the next generations. That might be more important than x% behind on benchmarks.
GPT does way worse than Opus without their harness, but better with it.
Opus 4.7 and 4.8 do way worse than 4.6. (Intentional nerfing?)
Would have been interesting to see GLM in the custom harness.
Would also be interesting to run GLM in Claude Code, which it has presumably been fine tuned on.
What does that mean for the frontier?
https://lifeinthesingularity.com/p/glm-52-proves-ai-comes-fo...
Where's the cost per vulnerability for all the other models than GLM?
Also, without code this isn't very trustworthy. Could all be made up as well.
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...
It depends a lot on the task and harness too (using plans and to-do lists, vs one-shot answers), but for simply answering directly to an inquiry, often extra thinking doesn't necessarily improve the answer, especially if the answer is binary, or can be correct or wrong, as opposed to having more time to refine a creative output.
https://aibenchy.com/compare/google-gemini-3-1-flash-lite-hi...
So one can see businesses owning their own such cluster, next to their database infra, in the near future.
Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).
It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.
How are we supposed to stay skeptical of everything if we read anything!?
GLM may be a good model in general but it s benchmaxxed and definitely not as good as Opus 4.8.
I use DeepSeek V4 Flash (high) and MiMo 2.5 (non Pro, because vision) to work on medium sized projects (~1mil lines of code, C#, Go, TypeScript) with great success.
And that is coming from someone who used Opus 4.7 and GPT 5.5 as workhorses before.
And I'm pretty sure GLM 5.2 is better than the lighter models I use.
My worflow is simple: plan -> clarify -> implement.
1) plan prompt template: I describe what I need and ask LLM to generate a markdown file containing an implementation plan plus at least 10 clarification questions for me to answer.
2) I answer the questions in the plan.md file.
3) implementation prompt template: I ask LLM to implement plan.md and tell me at the end if there were any deviations and new findings during the implementation (there ofter are).
People will use the model with the harness. I know that harness may not be optimized to this model, but it's still more useful to see the numbers from an imperfect harness than from a no harness setup.
What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?
But I would like to point out that the overwhelming majority of people using LLMs aren’t programmers, don’t care about coding, and couldn’t even be bothered to “vibe code”.
So we should consider the bias of the output of these open weight models, and what that looks like, outside of the context of writing code.
I don’t agree with “Software development is where money is made for these labs”. Coders will inevitably eat up the most tokens & buy the bigger $200 subscriptions because we want to keep working.
But us coders are still the small minority of users. They aren’t counting on us to get to trillion dollar evaluations.
They are counting on the regular folks to buy the $20/ month subscription. It’s really easy to run out your free tier usage these days, asking questions that have nothing to do with coding.
So my point is what does that output look like for someone asking a question about politics or world news?
I know Google gives me free Gemini AI from my Google Drive plan. Microsoft probably already does too, didn't test. Apple is probably crafting some arrangements if not offering already.
My point is most people wont pay for AI. It will be bundled.
And I think AI is going to be free for all, with ads.
I wouldn’t call that “no money”
Oddly, this is a strong indication of the text being hand-written rather than LLM-assisted; it's very likely that a human made a mistake in creating the table.
> ... beating Claude Code (32%) ...
> ... GLM 5.2 ... beat Claude Code by seven points (39% vs. 32%).
> Rank | Configuration | Harness | F1
> ...
> 4 | Claude Code (Opus 4.6) | Claude Code SDK | 37%"our IDOR benchmark", there you go.
Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?
GPT gets there <5 minutes, GLM 5.2 without context takes ~1H.
Though the harness makes a significant difference. On Pi GLM5.2 dreams for minutes, with OpenCode it's more on the point and gets to editing quicker.
I think the post is still informative, but very a little disingenuous and clickbaity.
Now I feel like that I'm covered by GLM 5.2 and Minimax M3 (when I need vision or a second pass on something).
It just goes off getting confused about how to design the map for 15 minutes and then times out.
Having used GLM 5.2 for non-security software work, I can say it's better than Sonnet (but not Opus), and cheaper than both (because when you steal someone else's IP, you don't have to amortize the cost of their R&D).
Or grabbing their GLM Coding Plan directly: https://z.ai/subscribe
I went with the second one to try it out, feels pretty okay (with OpenCode, though Claude Code would also work), however it feels like I reach the weekly limits somewhat fast with their 65 USD Pro subscription. They also have that whole peak times thing going on and apparently it will get worse after September:
> Supported models and Visual Understanding MCP share the same usage quota. GLM-5.2 and GLM-5-Turbo consume quota at 3x during peak hours and 2x during off-peak hours. Limited-time benefit: off-peak usage is currently charged at only 1x quota through the end of September. Peak hours: 14:00–18:00 daily (UTC+8).
Any good resources about this (also for setup and recommend config)?
After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.
Signup for GLM-5.2 here: https://z.ai
When I posted the comment I was both the first commentor as well as the first person to upvote the submission. That matters. My name is ALSO on the open source repo that allows Opencode to be run in a container.
That's transparency, maybe not here, but on a clickthrough to Github it is immediately obvioius.
Not sure a project nobody knows or uses is much better in this regard?
I think they give $5 trail credits to test with any of the open weight models.
Instead of shilling for the LLM providers.
You need to read the market. Linus didn't read it in 90's, Gates did and that's why Windows is in almost every home.
The only niche where it doesn't utterly dwarf the competition is personal computers and it looks like we're all getting priced out of that anyway
I'd be mostly fine switching to it.
I just can't find a cost effective way to do that. z.AI's coding plan is both overpriced and unreliable. ollama's is also overpriced. Paying by the token for it on openrouter etc is more expensive than just having a Codex or Claude coding plan.
If you have to pay by the token, it's clearly cheaper. It's not competitive with a coding plan though.
Because you would have to switch model.
You can't just say "Oh, button X looks weird see [screenshot]" while coding with GLM. You would need to switch to another model and then maybe back.
This article only talks about detecting vulnerabilities, so it's unclear if it's a true Mythos equivalent.
This is where we are heading and why many closed labs are terrified of this affecting their bottom line and the reason why they want them banned from being released.
99.99% people's day jobs aren't competing for the Fields Medal or even finding security vulnerabilities. So it appears while TAM (total addressable market) of AI in general is huge, TAM for frontier LLMs is tiny. Efficiency gains at roughly the same performance might be all people care about from now on.
The incentive to develop these Chinese models further is to trash the business case of most American AI labs.
additionally, reliable API, because z.ai can be finicky
also, not for Enterprise use, but I like non-US providers, I don't care if the party happens to be the one reading my information and stealing my trade secrets, if they won't respond to a US subpoena
What explains it?
Is TFA lying? Is the most upvoted comment here lying?
The article itself doesn't say "it's better", basically just says "in this one specific benchmark it beat Claude with Claude code". Mind you with multimodality it Opus still beat GLM 5.2 very handily in that same benchmark.
I can't find any contradiction and I don't see anyone lying directly. At most they lead you to imply false things, but they're not untrue at a literal reading.
There's a number of US providers who also run it, if that is your preference.