GLM-5.2 is a step change for open agents

upvote

GLM-5.2 is a step change for open agents

(www.interconnects.ai)

347 points

by vantareed2 days ago |

upvote

by jerojero2 days ago|

[-]

Open weight models from Chinese labs tend to be significantly cheaper.

I think theyre absolutely needed. I can't afford 200 USD a month for personal use of coding AI, and I don't think such prices are reasonable for most of the world economy anyway. Not to mention US firms might be giving their employees a lot more than that.

It's increasingly feeling, to me, that theres a gap building up between haves and have nots. But then, we get news of these open weight models that are reasonably priced in inference with reasonable capabilities. Yes, they take maybe 6-9 months to get there, tbh, that's not a bad trade off at all.

reply

upvote

by fbrncci17 hours ago|

[-]

You made me realize something. I routinely spend upwards of 500$ per month on LLMs for coding (expensed towards clients). However I live in a place where 500$ is around the avg. salary. I’m lucky that I know my way around western clients. Clients who pay these expenses and are happy to work with me because I am still about 50% cheaper than local talent in EU/US, while my salary at home converts to an upper class income at the highest tax bracket.

Which of course causes some unfairness on both ends. Nobody here can compete with me. I often use left over tokens on local client projects; which despite lower pay, still pays off because they now take hours not days or weeks to complete. And nobody in the local clients talent pool can compete with me; unless they charge about half the market rate.

Take away my 500$ monthly grant; and I’d be more or less screwed. Better open models will more or less start to reduce this advantage. It’s not like I positioned myself here on purpose. But it’s definitely a „right place, right time“ situation.

reply

upvote

by whazor6 hours ago|

[-]

The problem is that the differences between flagship and local models are compounding heavily. An 4% different could be massive when you keep iterating on the same code base.

reply

upvote

by swiftcoder4 hours ago|

[-]

> The problem is that the differences between flagship and local models are compounding heavily

This depends a lot on how you work, and how much of the architectural thinking you do yourself.

People seem to lose sight of the fact that a flash model today is as powerful as a frontier model from a year ago. If you were happy with GPT 4.x, you should be ecstatic that equivalent power is now basically free...

reply

upvote

by wolttam2 hours ago|

[-]

I am one of those ecstatic folk :)

reply

upvote

by listic15 hours ago|

[-]

Thanks for sharing your insight.

Mind if I ask you for a few vibe coding tips? I failed to solve you gh puzzle in the profile though.

reply

upvote

by swader99917 hours ago|

[-]

If you are running multiple agents your cost to them should be multiples less what their roi is.

reply

upvote

by fbrncci17 hours ago|

[-]

My costs are 0$ as any token or subscription spend on agents is invoiced as an expense to my clients.

reply

upvote

by kreelman15 hours ago|

[-]

Thanks so much for being bold enough to be fairly open about the costs, how you arrange billing and the advantages that's given you.

I've been fooling around with DeepSeek 4 agentically. It's probably not as good as Anthropic offerings, but even those seem to be roiled in politics and strife and DeepSeek 4 is very good IMHO. I'll later try out GLM.

I'm in Australia. The government has set up a "return and earn" scheme to keep aluminium cans, plastic bottles and paper drink cartons out of the waste stream. A laudable project. The money you make from return drink containers is pretty low, $AU 0.1 per container. I've participated to get the rubbish out of natural water streams and to make a nano amount of money on the side.

When I looked at the costs of an app I was getting DeepSeek to help me with, I realised that the several hours I'd spent learning and building had cost something like 8 recycled containers. In my head after doing some DeepSeek stuff, I calculate a "cans per app" metric for myself for fun. I may even setup a simple graph to view my costs that way.

I kind of hope the Anthropics of the world get enough price competition from sources like DeepSeek and GLM to drop their prices significantly. Time will tell.

I'm using the Chinese DeepSeek provider, so everything done there could potentially be taken and used by the CCP... But this is hobbyist learning.

There is probably a market for Deepseek/GLM served from non CCP available servers. I might even look into how hard that would be to setup here.

I also hope that inference focused hardware will come to the fore, reducing energy use and cost. Realistically this will take time though, on the order of years.

Here in Oz, we have community batteries that community members can charge and later draw from. Their electricity prices are competitive. I wonder if someone could setup something like a community battery to run data centres... That way reasonable environmental consideration could be given to inference power generation... This might not work in a market like the US or Europe, but small market size might be an advantage... Who knows.

reply

upvote

by SyneRyder8 hours ago|

[-]

> There is probably a market for Deepseek/GLM served from non CCP available servers. I might even look into how hard that would be to setup here.

Please do. There is definitely a market for Deepseek / GLM hosted from non-China servers, there's over 20 providers for GLM 5.2 on OpenRouter alone... and they're all either Singapore (home of Z.AI / GLM), China, or US. There is nothing yet listed on OpenRouter from Europe (Inceptron still only has GLM 5.1). And of course, there is absolutely nothing hosted in Australia.

We're in a particularly dire situation in Australia. We're about to be cut off from Claude Fable and premium American models. The European Mistral models are garbage, at least in comparison to US models. Our only hope is going to be Chinese models (GLM 5.2 is good), and we're not even hosting them in Australia.

By the way, if you haven't tried an Anthropic model, it's worth spending at least $20 one month to give Opus 4.8 a try. I only got one night of access to Fable before I was cut off, but one single evening of Fable provided plans that I've been working through for about a week afterwards with Opus 4.8... and that was only Fable, not even Mythos. That's the kind of intelligence lead Australia is about to be cut off from.

(And kudos on the Containers For Change, that's something I do as well - mostly as an exercise incentive to walk to the local recycling machine, because the money certainly doesn't compensate for the time spent on the recycling.)

reply

upvote

by Mossy95 hours ago|

[-]

Cortecs (EU router) lists GLM 5.2 from Tensorix and Nebius https://cortecs.ai/detailedServerlessView/glm-5.2

So two European providers at least

reply

upvote

by trollbridge5 hours ago|

[-]

Hosting in Australia is not feasible at Australian electricity prices.

(Speaking as a not-so-proud Australian.)

reply

upvote

by Sanzig6 hours ago|

[-]

Same issue in Canada - domestic inference capability for the open models is woefully behind.

reply

upvote

by trollbridge5 hours ago|

[-]

Canada has fewer excuses, given sparsely populated places that are cold with nearly infinite water and extremely cheap electricity.

reply

upvote

by Sanzig4 hours ago|

[-]

Yep, agreed. Main issue in Canada is a notoriously slow and stingy investment ecosystem. Resource-wise we're incredibly well positioned.

reply

upvote

by forshaper4 hours ago|

[-]

Would you happen to know why there are so many Canadian investments in American telecom?

reply

upvote

by usef-9 hours ago|

[-]

It's very easy to use other providers. See https://openrouter.ai/ which also lets you filter by where the provider is hosted and their data retention policy.

Jeremy Howard was recommending fireworks.ai as a host of you want to go direct. Or there's Cloudflare.

For subscription alternatives people here on HN seem to mention Open Code Go a lot too https://opencode.ai/go

reply

upvote

by esperent12 hours ago|

[-]

> I'm using the Chinese DeepSeek provider, so everything done there could potentially be taken and used by the CCP

As opposed to Anthropic or OpenAI where everything done could potentially be taken and used by the US government.

Also, replace "could potentially" with "will definitely" in both cases, there's no conspiracy here.

We're stuck between two bad positions, so just use the one that's best for you, and wait for a better solution to arrive.

reply

upvote

by dudisubekti6 hours ago|

[-]

You don't seem to like the "CCP" and their political views, but why are you using their sponsored models?

Why don't you exclusively host and use the open-weight western models, even if right now they don't perform as well?

I'd like to know the psychology behind this, because your actions feel contradictory to me.

reply

upvote

by lanthissa9 hours ago|

[-]

AI is the first technology that doesn't incentivize offshoring, and incentivizes co-location of talent.

A NYC dev and a dev in india have the same ai costs, based the ratio tokens/salary it becomes less of comparative disadvantage to be in NYC.

Now combine that with the fact that AI makes the act of generating code less a % time of the job, and the ability to get/refine requirements more of the job and you have a decent shift.

reply

upvote

by Sammi7 hours ago|

[-]

Errr you just responded to someone that is offshore and is using AI to be much cheaper than local talent.

reply

upvote

by Fr0styMatt8818 hours ago|

[-]

If we can agree that the AI model is at least as capable as a junior engineer or new contractor, how’s that different to saying “software engineering isn’t worth $200 a month”?

Has a very race-to-the-bottom feel to it.

Though in the grand scheme of it, $200/mo probably isn’t the real price either. Also looking at it not just in a vacuum - paying for a product that can change what you get from under you doesn’t seem great anyway.

At least with a locally-hosted model you know what you’re getting.

reply

upvote

by matheusmoreira18 hours ago|

[-]

Yeah. There's no way to verify what these providers are doing. The real future is running these models at home. Opus level inference on our own hardware would be a dream come true.

reply

upvote

by baq6 hours ago|

[-]

I dream of having an LLM in a box over usb bought off AliExpress for a year and change now.

The LLM in a box is something you can buy today, but it 1. doesn’t serve over usb by default 2. costs $100k for hardware (not counting electricity) at 100 tps 3. can’t buy this from AliExpress.

Better to put that $100k in t-bills and just buy tokens even at api prices.

reply

upvote

by rescbr2 hours ago|

[-]

I understand your point (and definitely want the same), but I do have an almost-AliExpress-LLM-in-a-box: it's an Thunderbolt eGPU dock (that I got from AliE, and it is USB-C...) with a RTX 4060 Ti with 16 GB of VRAM (bought locally for gaming before the price boom)

It's been awesome for embeddings and document OCR!

3D printing a case for it is on my todo list.

reply

upvote

by IncreasePosts17 hours ago|

[-]

How will anyone running home instances be able to compete against people paying some money running much more powerful models on much more powerful hardware?

reply

upvote

by Fr0styMatt8816 hours ago|

[-]

It’ll be interesting.

I’m using Qwen3.6:27B at home and mostly Sonnet/Opus (depending on the complexity of the task) at work.

You have to break things down into smaller chunks for the local models. For the bigger cloud ones they can do a lot of the broader thinking.

reply

upvote

by fragmede10 hours ago|

[-]

Time is money, but apparently now thinking is money as well. How much is it going to cost to think harder? If it's, say, $10 to use a bigger cloud model, it becomes easier to qualify the cost of thinking.

reply

upvote

by Bombthecat8 hours ago|

[-]

Yeah. There always will be a gab. And it will keep growing for the next years...

reply

upvote

by jimbokun14 hours ago|

[-]

At some point it will be hard for us to tell the difference.

reply

upvote

by RazorBucksICO6 hours ago|

[-]

The appropriate price is what the output is worth to you. Some people could pay $10,000/month, some $5 and feel like they were breaking even. There is a big jump between convenience and curiosity uses versus business critical.

OpenAI already charges enterprise users a premium purely for that title over on-demand, no-contract usage. Retail users get a good deal. People make a lot of hay about subsidies but this is a very sane approach if you want exposure to these three different types of customers.

reply

upvote

by tacomagick1 days ago|

[-]

DeepSeek through their own API has saved me tons of tokens honestly. Even though it is not as smart as Kimi or Claude, their level of entry is very low with a top up of 2$ and Pay as you go compared to the subscription of Claude or 20$ top up of Kimi

reply

upvote

by praveer1320 hours ago|

[-]

For personal use I’m considering using the frontier models from openai or anthropic to create a plan with research and brainstorming etc with enough details for cheap models to be able to follow (glm, deepseek etc) - with openrouter - will monitor how cheap and effective that turns out to be.

reply

upvote

by ImaCake18 hours ago|

[-]

You should try out the cheaper models first. I find Deepseek v4 models pretty comparable to sonnet 4.6 but at a fraction of the cost. You might find you just don't need to use the American models at all.

reply

upvote

by lionkor11 hours ago|

[-]

Seconding the recommendation to use Deepseek directly via the API. I've burnt 287 million tokens in the last couple of days, costing me a whopping $5.77 USD.

reply

upvote

by tacomagick12 hours ago|

[-]

For my case Openrouter breaks Deepseek caching and charges me multiple times over what I pay for Deepseek's API, with 2$ I was able to get around 120M tokens from deepseek easily when Openrouter could only barely do 250k

reply

upvote

by jabroni_salad4 hours ago|

[-]

deepseek's direct API is super loosey goosey about caching. On multiple occasions I have gotten cache hits resuming a session from the previous day.

reply

upvote

by mdjxnxnxnd9 hours ago|

[-]

I call this the reviewer/implementer pattern.. Opus for planning then ds4/qwen/kimi for.implementation then opus for PR review

reply

upvote

by arikrahman18 hours ago|

[-]

Someone else on this forum put it well, U.S. is trying to achieve AGI at all costs, while Chinese models are seeking widespread adoption.

reply

upvote

by rglullis9 hours ago|

[-]

> U.S. is trying to achieve AGI at all costs

If that was true, they would be collaborating with each other and opening up all the results from their work.

reply

upvote

by lionkor11 hours ago|

[-]

None of the AI companies in the US are on the path to AGI. They are, however, on the path to claiming they have AGI, then subsequently not releasing it and only giving it to the US government to make drones that can bomb the homes of political dissidents.

reply

upvote

by dotancohen10 hours ago|

[-]

What kind of off topic political ideology spam is this? Do you not think that the Chinese kill their enemies?

The Chinese are genociding Uyghurs as we speak, purely for being Muslim, in numbers that dwarf any harm the US has done.

reply

upvote

by lionkor8 hours ago|

[-]

> in numbers that dwarf any harm the US has done.

The list of wars the US is or was actively involved in[0] is SO LONG that the Wikipedia page is split into multiple different pages.

The main relevant ones are 20th[1] and 21st century[2], for which you better get a good grip on your mouse to scroll down.

I urge you to use your favorite AI to give you a rough summary of direct and indirect casualties of just those wars directly caused, started, or provoked by the US, from these lists.

For example, the "war on terror" alone has, so far, seen around 4.5–4.6 million+ people killed, and at least 38 million people displaced.

[0]: https://en.wikipedia.org/wiki/Lists_of_wars_involving_the_Un...

[1]: https://en.wikipedia.org/wiki/List_of_wars_involving_the_Uni...

[2]: https://en.wikipedia.org/wiki/List_of_wars_involving_the_Uni...

reply

upvote

by metobehonest2 hours ago|

[-]

The US funds Israel and it is only those funds and military aid that keep it from collapsing unto itself. That's the state that orchestrates the largest scale genocide by a "first world" power since WW2, as recognized by the United Nations and independent organizations like the Amnesty International.

https://amnesty.ca/wp-content/uploads/2024/12/Amnesty-Intern...

Nothing China did comes close to this.

reply

upvote

by andriy_koval52 minutes ago|

[-]

> as recognized by the United Nations

its not, this would require voted resolution to declare genocide. It was some report on inquiry by individuals with unknown bias.

reply

upvote

by azinman218 hours ago|

[-]

I don't think anthropic/openai/google aren't also seeing widespread adoption. In fact they already have they already have the marketshare.

reply

upvote

by Turskarama13 hours ago|

[-]

The difference is that the US companies are using it as a means to an end, they need to make just enough profit that the investors don't all get cold feet before they get to AGI. The Chinese companies on the other hand are trying to be profitable immediately, which means that they're going slower to save development costs.

reply

upvote

by tsss10 hours ago|

[-]

Everyone wants widespread adoption, of course. I'm sure that China is also working on more expensive frontier intelligence models behind doors, but they're lagging behind America on that front. Going for cost-optimized open weight models is their bet to stay relevant in a market where they can't compete for the "luxury" segment. It is important for them to get a foot in the door and maintain a presence in the press to attract future customers, given the general animosity towards China in the west that they need to overcome. Similarly, European providers like Mistral are hopelessly outclassed in every respect and thus try to carve out a niche in the market with regulation and anti-American fearmongering. They position themselves as "privacy-conscious" not out of goodwill but because it is their only chance to survive as a company with an utterly inferior product.

reply

upvote

by giancarlostoro5 hours ago|

[-]

As much as I don't like Mark Zuckerberg, part of me wishes he would get his head in the game and compete with these models, he's literally got all the capability to do so, and he could easily sell the model through deals with GCP, AWS, and Azure. Hell, Amazon needs a hot model they can host that's exclusive to them I feel like, maybe he can work something out with them, whatever the case, it seems so glaringly obvious to me, I'm not sure why he hasn't taken a stab at competing with Claude Code or at least frontier open models and then cutting a deal with cloud providers to recoup the costs of maintaining said models.

He's sitting on a frontier model letting it burn a hole in his wallet that could actually pay for itself.

reply

upvote

by khurs5 hours ago|

[-]

Meta internally have been using Google Gemini

"Meta has been using Google’s Gemini large language model for most of its moderation and customer support, but staff have recently been told to switch to Meta’s new foundational model, Muse Spark, the people said."

https://www.ft.com/content/39251a31-4a9d-4870-b86c-dc6353d67...

reply

upvote

by giancarlostoro4 hours ago|

[-]

It feels really insane to me that they have a model that could be better, but its just sitting there burning a hole in his wallet instead as he chases trying to recreate Grok's companion thing.

reply

upvote

by cameldrv1 hours ago|

[-]

Yes, but you’re paying with your data unless you’re hosting with a provider you trust or self-hosting.

reply

upvote

by ImaCake18 hours ago|

[-]

Significantly cheaper than comparable models if you are using openrouter [0]. Just yesterday I spent roughly 13 cents centering some divs using Deepseek in a personal project. It would have been north of $1 to do that with a US frontier model.

0. https://openrouter.ai/compare/z-ai/glm-5.2/anthropic/claude-...

reply

upvote

by ipaddr12 hours ago|

[-]

For centering divs the free models opencode offers can easily handle that work. DeepSeek V4 Flash is pretty decent.

reply

upvote

by ImaCake8 hours ago|

[-]

Sure, but something that is “sonnet tier” is going to get there faster and with less pain. Well worth the 13 cents!

reply

upvote

by ipaddr5 hours ago|

[-]

Flash will get their faster then the sonnet tier which involves reasoning which is slow. And you don't need reasoning to center divs.

The sonnet tier sits below claude or chatgpt in terms of price but costs so much more than free models. If you are breaking downtasks now I'm not sure that 13 cents is worth it.

reply

upvote

by narrator7 hours ago|

[-]

The tokens cost the same everywhere on earth. This does hurt some cost advantages of outsourcing when tokens start to become a bigger part of development costs.

reply

upvote

by brian-armstrong12 hours ago|

[-]

I read these stories and I can never figure out how people are managing to use these $200 plans. If I really go full bore, I can sometimes max out the $20 plan. Even then, it already produces more code than I can reasonably review and merge.

reply

upvote

by ipaddr12 hours ago|

[-]

I've maxed out my chatgpt plus the first week and that include an smf forum rewrite. Trying my best I haven't been able to max out again. Things are setup that you need to max out your 5 hour window multiple times which becomes a job in itself.

At work I'm struggling to keep my claude bill around $500.

reply

upvote

by girvo8 hours ago|

[-]

Simple: a lot of the people claiming they’re reviewing the output of these models are lying.

Also if you run the “loops” they’re now yapping about, it will burn through enormous amounts of usage as well.

reply

upvote

by theoli2 hours ago|

[-]

Exactly this, it’s the loops. The first 50k tokens of a task is by far the most valuable. But when left to run independently, the agent will consume millions of tokens of error messages from running tests and discovering a minor syntax error, a missing import, a method call with incorrect parameters, etc. Then it will write some helper program while debugging the main task and get into the same loop debugging minor errors in the helper. From my experience, the vast majority of tokens consumed by Claude Code on totally independent tasks are consumed fixing minor mistakes it just made.

reply

upvote

by hgomersall8 hours ago|

[-]

I can't even keep up with the chain of thought needed to manage a single session, let alone review. I typically never exceed 30% of a 5x plan. Fable took me almost to the limits, but not Opus. Claude design hits things harder, but still not to saturation.

reply

upvote

by RugnirViking8 hours ago|

[-]

do you do it for a job (8 hours a day)? and do you work in large, mature projects (more than 5 team members)? A big part of it is dealing with frankly terrible architecture and 15 people's different ideas of how things should work (and the spam theyve been able to do with their own agents makes this worse)

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by matheusmoreira18 hours ago|

[-]

> It's increasingly feeling, to me, that theres a gap building up between haves and have nots.

People speak of a permanent underclass.

https://www.nytimes.com/2026/04/30/opinion/ai-labor-work-for...

reply

upvote

by alpineman11 hours ago|

[-]

With open weight models there is true inference competition. Whoever can serve the model at the lowest price. And the consumer wins. Capitalism, served by China.

reply

upvote

by throwaway-blaze18 hours ago|

[-]

Just don't ask it to tell you the events of June 4, 1989.

reply

upvote

by swingboy8 hours ago|

[-]

My work involves asking LLMs about both Tianenmen Square and what’s going on in Gaza, so I can’t use Chinese or American models!

reply

upvote

by girvo8 hours ago|

[-]

Not that it matters but most of the open weight models aren’t actually censored that way: they run another layer on top of to do that. At least some of them do, Step 3.7 Flash locally happily tells me about the Tiananmen Square massacre

reply

upvote

by jampekka8 hours ago|

[-]

[flagged]

reply

upvote

by ttoinou2 days ago|

[-]

200 is much less than the value you’re supposed to get out of it. If it’s not then yeah go ahead and use cheaper models with worst quality

reply

upvote

by martinjc20 hours ago|

[-]

Are you aware of how much purchasing power 200 dollars is in china, brazil, thailand or india is? This is an extremely arrogant take.

reply

upvote

by dash212 hours ago|

[-]

Parent’s point was that many many people will get much more than $200 value from the “expensive” model. Sure, a Bihar farmer won’t, but even an Indian software developer may easily do if he or she has Western clients.

reply

upvote

by nwienert18 hours ago|

[-]

I’ve hired many asian developers anywhere from 1-4k a month.

I get a lot more out of a 200/mo subscription now in a week than I did from them in a month.

Now obviously in today’s world they’d be using a 200/mo subscription themselves. But it’s not like money is nothing, software development doesn’t scale down below 1k/mo for anyone competent even in the poorest areas.

reply

upvote

by xydone14 hours ago|

[-]

The point the post you replied to is making is that while you get value out of it, and in your case it's not that expensive, it's just simply not the case worldwide

reply

upvote

by nwienert13 hours ago|

[-]

I don’t think you’re really reading between the lines.

reply

upvote

by mrngld6 hours ago|

[-]

What's that got to do with the cost of a thing? Are tradesmen in Thailand entitled to Makita tools just because American plumbers can afford them? I'm struggling to understand the entitlement in some of the comments. And even though it doesn't matter I'd point out it's not like OpenAI or Anthropic are making enormous profits at the moment.

reply

upvote

by matheusmoreira18 hours ago|

[-]

For the record, 200 USD is around 60% of the brazilian minimum wage.

reply

upvote

by ttoinou3 hours ago|

[-]

How about brazilian median software developer wage ?

reply

upvote

by matheusmoreira10 minutes ago|

[-]

According to Glassdoor statistics, brazilian developers make between 600-1600 USD per month on average. Seniors might rise above 2000 USD.

So a 200 USD subscription falls between 10% and 33% of an average brazilian developer's salary.

reply

upvote

by Dayshine1 days ago|

[-]

I'm not sure how I'm supposed to get $200 of value out of personal use!

reply

upvote

by LPisGood21 hours ago|

[-]

Note that 200 dollars of value is different than 200 dollars of profit.

reply

upvote

by devmor20 hours ago|

[-]

I personally don’t find it that useful for most tasks, but if say, you get paid $50/hr for your work and it saves you more than 4 hours of work in a month, there you go.

reply

upvote

by selcuka17 hours ago|

[-]

Obviously this assumes that you can find 4+ extra hours of $50/hr work every month, or you can work 4 hours less. Neither of these assumptions is correct for people who work for a fixed salary.

reply

upvote

by windexh8er5 hours ago|

[-]

I think this is the rub the enterprise will be forced to grapple with. Not everyone is going to get $200 worth of value for the organization. In fact since it's not a restricted tool some will waste time and company resources using it. Undoubtedly some will get the value out of it, but it's very likely, that these are the same people providing more than what they're paid already. Nothing has changed other than, potentially, time savings and (hopefully) output improvement. Neither of those are any sort of guarantee though, either. Subjective systems are hard to show value, especially in the long term.

reply

upvote

by devmor13 hours ago|

[-]

That doesn’t change value. It’s value whether or not you can maintain a profit over it.

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by holoduke20 hours ago|

[-]

Here most of my colleagues have +200 dollar rates. It's really a no brainer. But sure, in south America or some Asian countries maybe it is. But still most devs need it anyway. Also in the poor regions.

reply

upvote

by Kuxe11 hours ago|

[-]

In Sweden $200 is ~5% of average programmer monthly income after tax. $200/h rate is not a representative salary for SEs in South America, Asian countries nor Europe.

If you're running a business I agree it's a no-brainer, but the context here is for personal projects.

reply

upvote

by holoduke10 hours ago|

[-]

Come on. The 200 spend on Claude is easily earned back. A few hours of work maximum.

reply

upvote

by HDBaseT19 hours ago|

[-]

$200/h is on the extreme end and I would argue most people here aren't anywhere close to that.

The median hourly wage in the US is $28/h, this equates to nearly 7.5 hours. A full day of work a month for the average person to use Claude with reasonable limits.

Yes, the people on $28/h may not be the software development types, so their income might not be as high, but these are the people who would probably be vibe coding the most since they aren't day to day programmers!

reply

upvote

by ray_kay77718 hours ago|

[-]

I suspect the reply above is referring to charge out rates rather than wages.

reply

upvote

by HDBaseT15 hours ago|

[-]

My fault, thanks for the correction (:

reply

upvote

by folkrav18 hours ago|

[-]

Most of the world's developers, even in not-poor regions, make significantly less than what your colleagues charge.

reply

upvote

by uberex20 hours ago|

[-]

Unless that value is $200 cash in hand it will be hard to afford it for people who just don't have $200.

reply

upvote

by margalabargala19 hours ago|

[-]

Last time you bought a computer, did you buy the absolute fastest best CPU available?

reply

upvote

by girvo19 hours ago|

[-]

Yes, but that was because I could see the writing on the wall with respect to hardware prices being cooked by AI demand, so I built the best computer possible at the time knowing it'd probably need to last me the next 5+ years

So not really comparable. I use Step 3.7 Flash locally, models are good enough for so many coding tasks even at the lower end! (Though I note that calling a 200B model "lower end" is kind of amusing)

reply

upvote

by smrtinsert19 hours ago|

[-]

I've actually come to believe the overwhelming majority of use cases require nowhere frontier quality so there's that. Much faster execution is just a bonus on top of the much reduced cost

reply

upvote

by geye12341 hours ago|

[-]

Curious to hear if anyone has tried running the 2-bit or 3-bit quantization of this. With a bit of investment I may just be able to swing it locally. I already have 96GB VRAM, so with 192GB RAM, which seems to be the most one can find these days with a 4-slot motherboard, I may be in with a shot. Yes, it'd be slow, but I could give it overnight jobs. But I don't know if running at such a low quantization would make it hallucinate with only a small context.

Qwen and Gemma are great, but they need babysitting every 30 mins, which is quite a cognitive load.

reply

upvote

by christophilus18 hours ago|

[-]

I've been working with Deepseek V4 Flash (with opencode as the harness). It's been almost indistinguishable from Codex / Claude Code for me. I'm sure I'll run into problems when I get to a stickier ticket to tackle. But so far, it's been quite good, and I find it writes straightforward code.

I do think the Chinese models are good enough for an 80/20 rule use case.

reply

upvote

by mark_l_watson6 hours ago|

[-]

I also use DeepSeek v4 flash and v4 pro, but I can’t settle between using Claude Code or OpenCode and it seems like I waste time switching back and forth (especially keeping my personal SKILLs files synced). On one hand, a ton of engineering work has gone into Claude Code, on the other hand all Chinese models I have tried with OpenCode seem well configured out of the box.

I was thrilled to have Gemini Ultra for a month and use as many Opus tokens with AntiGravity as I could use, but I am happier using less capable models like DeepSeek knowing that it is more fun to do more of the work myself, it is a smaller hit on the environment, and incredibly cheaper.

reply

upvote

by scottchiefbaker15 hours ago|

[-]

I tried Deepseek V4 Flash with very low expectations and was pleasantly surprised. It's a surprisingly capable model for the price.

reply

upvote

by timcobb15 hours ago|

[-]

What provider(s) do you use?

reply

upvote

by saaspirant10 hours ago|

[-]

Not op but I use their official platform. Cheapest token top-up is $2.12

reply

upvote

by vagrantJin5 hours ago|

[-]

That v4 quality is available to everyone in the world for a pittance is beyond remarkable.

reply

upvote

by solarkraft7 hours ago|

[-]

I use Pro because I’m insensitive to the price difference, but also found Flash very capable in OpenCode.

reply

upvote

by nunodonato10 hours ago|

[-]

it would be a really great option if it didn't lack vision

reply

upvote

by pizzafeelsright1 hours ago|

[-]

this is mcp or custom call to lowest cost model

someone did a webcam + agentic + capture of other computer bios/boot -> upload to image model -> back to agent

reply

upvote

by RugnirViking8 hours ago|

[-]

what do you use vision for? I have failed to find a workflow with it that makes sense, asking it to review screenshots of websites or whatever it misses extremely obvious details like text flowing out of it's container/overlapping other text, things being in entirely the wrong place, etc.

reply

upvote

by bckr4 hours ago|

[-]

What models have you tried? Gemini 3.1 pro has vision capable of reading my sloppy diaries from 10 years ago, down to small glyphs and doodles.

reply

upvote

by RugnirViking4 hours ago|

[-]

I mean they mostly work for OCR, I meant in a coding context.

reply

upvote

by cromka8 hours ago|

[-]

For coding?

reply

upvote

by guybedo18 hours ago|

[-]

GLM-5.2 has been a step change in how fast i can burn through tokens.

I subscribed to their max plan to try it out. It counted me 700M tokens and drained my weekly quota in under 2 days.

Quota just reset less than 24h ago and i'm already >60% weekly quota usage.

For reference the kind of work i did would have used somewhere between 3% and 5% of Codex max or Claude max.

The model is good, the plan is a scam

reply

upvote

by try-working15 hours ago|

[-]

Kimi and GLM models have coined a new term: Thinkslop. They run a chain of thought that is up to 10x longer than other models and it seems that through a lookback mechanism they are able to use the CoT to reason about solutions to tasks they couldn't otherwise solve.

The downside is of course that they consume many more tokens off your plan, and also that they are significantly slower. Kimi K2.7 takes about 7x longer to finish the same benchmark tasks as DeepSeek V4 Pro on my router benchmarks (https://role-model.dev/).

So for now I'm happy with just two models: GPT and DeepSeek.

reply

upvote

by PhilippGille10 hours ago|

[-]

> Kimi and GLM models have coined a new term: Thinkslop. > [...] > So for now I'm happy with just two models: GPT and DeepSeek.

1. DeepSeek V3.2, V4 Flash, V4 Pro, at high or max thinking, ... when recommending a model it should always be a precise model, not just an AI lab

2. DeepSeek V4 Flash at max thinking is the most verbose model (among top models) in the AA benchmarks. See the "Intelligence Index Token Use" chart: [1]

[1]: https://artificialanalysis.ai/models?models=gpt-5-5-high%2Cg...

reply

upvote

by try-working6 hours ago|

[-]

I said specifically V4 Pro. Flash is not the most verbose, that's more likely to be Kimi.

reply

upvote

by guybedo15 hours ago|

[-]

yeah Kimi K2.7 was doing ok but was painfully slow. The coding plan limits were good though.

I haven't tried deepseek yet, i should check this one out.

reply

upvote

by try-working12 hours ago|

[-]

After the release of K2.7, the Kimi plan quotas have been reduced by about 80%.

reply

upvote

by spwa412 hours ago|

[-]

Turning up the thinking (max time spent thinking) lever really changes model performance, even for tiny models. But it's really irritating because it adds a lot of time.

reply

upvote

by jubilanti18 hours ago|

[-]

> The model is good, the plan is a scam

If it is needing to generate that many tokens to do the same tasks, then it probably has higher inference costs. So (for you) the model is bad, the plan is the same plan.

reply

upvote

by thefourthchime2 hours ago|

[-]

I gave it my standard:

"Make a pac-man game in a single html page"

It went off and argued with itself for 20 minutes about how to lay out the map and then timed out.

reply

upvote

by anatoliikmt16 hours ago|

[-]

What kind of tasks have you been using it for?

reply

upvote

by aunty_helen19 hours ago|

[-]

I signed up to a z.ai max account, $144. Hardly been able to use it as it 429s on most requests. They’re also refusing to refund me.

reply

upvote

by osti18 hours ago|

[-]

Even as a GLM z.ai fan, I wouldn't pay for their plans. They are just way worse values than gpt or anthropic plans, in terms of both usage and capabilities.

reply

upvote

by rescbr1 hours ago|

[-]

For me it works fine outside of afternoons Beijing time on weekdays.

reply

upvote

by 11 hours ago|

[-]

deleted

reply

upvote

by ticoombs6 hours ago|

[-]

Opencode Go subscription has served me well.

reply

upvote

by reissbaker15 hours ago|

[-]

Self-promo but you should try our service synthetic.new. We generally have up-to-date open-source LLMs on the sub, and we have GLM-5.2 :) Perf+stability should be wayyy better than zai.

reply

upvote

by dotancohen9 hours ago|

[-]

What do you do differently that you expect to have better performance than an experienced, established player?

reply

upvote

by mtlynch6 hours ago|

[-]

Not GP, but just being smaller makes it easier to achieve reliability. Like if you're a git forge with 100 similar customers, you can likely achieve an order of magnitude better reliability than GitHub, who is trying to serve millions of customers with wildly different needs.

reply

upvote

by dotancohen2 hours ago|

[-]

The problems for services such as GitHub with scaling are reducing cost per customer. That's even more pertinent when discussing inference at scale.

reply

upvote

by mtlynch1 hours ago|

[-]

> The problems for services such as GitHub with scaling are reducing cost per customer. That's even more pertinent when discussing inference at scale.

I don't think that's true. When I look at GitHub's incident history,[0] it doesn't read to me like a company that's struggling to cut costs. It looks like a company that's trying to do a million things to serve a million use cases, and the growing interconnections between all those distinct services and workflows cause unexpected failures.

[0] https://www.githubstatus.com/history

reply

upvote

by guybedo18 hours ago|

[-]

same here. Barely usable due to API connections issues.

And when i can use it, it just drains the quota 5 times faster than codex or claude.

Their plan is a scam

reply

upvote

by sergiotapia18 hours ago|

[-]

My experience as well unfortunately :(

reply

upvote

by fartcoin6716 hours ago|

[-]

[dead]

reply

upvote

by timcobb19 hours ago|

[-]

Can people share their GLM and open model setups in general please? What provider do you use. Why do you trust it with serving full quality? What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps). I am just trying GLM 5.1 from Nvidia build in open code would love to hear how you all do it, thanks.

reply

upvote

by 59nadir9 hours ago|

[-]

> What provider do you use?

1. My own harness + Local (which usually means Qwen3.6-35B-A3B), I use this fairly often for research gathering on topics, info gathering on code bases, etc.

2. My own harness + DeepSeek v4 Flash served by DeepSeek, I added $20 quite some time ago and somehow still have $18.77 in there after I don't know how many prompts. I use this pretty often, slightly less than my local setup, it's great and what I'm planning on running locally (eventually).

3. My own harness + OpenRouter with whichever model I want to try out. I use this very rarely.

4. Pi + OpenAI Codex $20 subscription. I don't use this almost at all anymore, but I keep the Codex subscription for testing things out to see how GPT-5.5 will handle a problem the other setups have issues with.

> Why do you trust it with serving full quality?

The only thing I've noticed seems unbearably useless sometimes versus what I noticed before was GPT-5.5 which has had some of the weirdest degradations I've seen. It's not to Anthropic levels but it definitely had some service issues a few times where I was wondering if they had accidentally (or purposefully) lobotomized it.

Everything else has mostly just been the same, except DeepSeek I noticed had some speed issues a few days ago.

> What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps)?

I pretty much only use my own, agents are trivial to make and it's definitely not hard to make one that's better than Claude Code or Codex for whatever you're doing.

reply

upvote

by timcobb57 minutes ago|

[-]

Do you write /maintain evals? This is something I want to get into more. Otherwise I feel really blind and feel compelled to just drop money on frontier.

reply

upvote

by mark_l_watson5 hours ago|

[-]

I want to say that I agree with you on the value of writing your own coding harness. I wrote something simple in Emacs Lisp and it makes me happy occasionally using it. I am trying to learn Rust and I am working on my own Rust core orchestration layer and I plan on both a Rust command line client and I already have a Python library wrapper for the Rust code that I have written so far. I write a lot of ‘little books’ and I am almost sure to write yet another one on my current hacking project.

Are my little hacks as effective as OpenCode or Claude Code? No way, but I am learning a lot and having fun.

reply

upvote

by michimagdesign19 hours ago|

[-]

Next to my Claude Pro plan, I have subbed to OpenCode Go. I find the OpenCode UX much better than in Claude Code CLI. As for models, I started a few months ago with GLM 5.1 and it was solid and could archive near sonnet-level tasks. It weirdly sputtered out Chinese characters sometimes. Then I switched to Kimi K2.6, which is the Chinese model I used the most until now. It used way too many reasoning tokens (improved in k2.7). But executed Claude created plans reliably. Now I’m back with GLM 5.2 and it’s really solid (among other things it’s good at design) and I get good usage with the $10 plan. Still the Claude models have less hiccups but the Chinese models are getting really close.

reply

upvote

by mark_l_watson5 hours ago|

[-]

OpenCode Go looked intriguing and I spent time reading their docs and pricing but didn’t purchase services. Do you think they are running it at a loss to get market share? (Probably not.) I have been happy buying tokens directly from DeepSeek (I am retired and everything I do is open source code and writing open content books (the manuscript files are available along with the source code) so I have no privacy issues). I also use FireWorks.ai to try different models. Both API services are excellent, but I may try OpenCode Go for a month or two to support the devs of OpenCode.

reply

upvote

by pramodbiligiri1 hours ago|

[-]

It's possible they are running at a loss at present. But in a recent podcast their founder said he believes inference is profitable, based on their experience serving models: https://newsletter.pragmaticengineer.com/p/opencode (search for "profitable")

reply

upvote

by rescbr1 hours ago|

[-]

Z.ai legacy Pro coding plan which will last me until the end of the year + maki.sh as the agent.

OpenCode works fine, i just find it very resource intensive for no good reason.

reply

upvote

by gandreani19 hours ago|

[-]

I use both the openai subscription and the opencode go subscription. I use the go subscription for my personal work and the openai subscription for my consulting work.

The differences between the models are minimal, but I usually stick with gpt-5.4-mini, gpt-5.4, mimo-pro-2.5, deepseek-v4-pro. These latter ones have way more usage than even using 5.4-mini so I tend to use them in personal projects for that reason.

My harness is https://github.com/can1357/oh-my-pi. I trust it...enough. It updates very frequently so as a safe guard I run it sandboxed with https://github.com/containers/bubblewrap so it can only access the project folder and some whitelisted config files

reply

upvote

by timcobb18 hours ago|

[-]

Thanks. I was looking at open code go yesterday and I couldn't figure out if the base pricing is including usage or if that's just base pricing and then you have to pay for usage too. How does it work? It is very cheap.

reply

upvote

by arcanemachiner17 hours ago|

[-]

OpenCode Go is a smoking deal IMO. You basically get 6x multiplier on the $10 price since you get $60 worth of usage for $10. And the first month is only $5 so it's even better.

It goes pretty quick, but it's still a great deal. Highly recommended.

reply

upvote

by johndough11 hours ago|

[-]

    > What provider do you use.

OpenRouter with pinned DeepSeek provider or OpenCode Go

    > Why do you trust it with serving full quality?

Quality seems good so far.

    > What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps).

I wrote my own. A minimal harness without dependencies is only 65 lines of Python.

reply

upvote

by chess10kp2 hours ago|

[-]

Pi is great, set it up with a system prompt to give the model more direction and think less, and it crushes anything I give it

reply

upvote

by smoe19 hours ago|

[-]

For work, I mostly use Codex and some Claude. For personal use, I’ve started using Chinese models directly through their respective providers, mostly for automation tasks and experiments so far, either via the API directly or through the Pi harness.

I do not trust any of them. Everything runs inside virtual machines, not just the sandboxes provided by the harnesses. I also do not run Claude or Codex directly on the host machine. Not just because of supply chain fears, but also because of how incredibly user hostile the VC funded companies are when it comes to installing random stuff on your machine.

reply

upvote

by ukuina15 hours ago|

[-]

Synthetic.new and Claude Code using GLM-5.2. Great model, but the harness will error out if using subagents. The base plan only allows one concurrent request at a time. Also, GLM will burn through your weekly quota in a day if you're not precise with your scope.

reply

upvote

by Fr0styMatt8815 hours ago|

[-]

Local using Qwen3.6-27B; 2xRTX 5070Ti graphics cards; VS Code with Cline at the moment and Ollama back-end (will get to trying the others soon).

reply

upvote

by rainmaking19 hours ago|

[-]

GLM 5.2 coding plan- I'll post the agent as soon as I can! But opencode works and their own zcode is really good as well.

reply

upvote

by mlmonkey17 hours ago|

[-]

Here are the numbers from their bar chart:

    1. SWE-bench Pro
    Model Score (%)
    GLM-5.2 62.1
    GLM-5.1 58.4
    Claude Opus 4.8 69.2
    GPT-5.5 58.6
    Gemini 3.1 Pro 54.2

    2. Terminal-Bench 2.1
    Model Score (%)
    GLM-5.2 81.0
    GLM-5.1 63.5
    Claude Opus 4.8 85.0
    GPT-5.5 84.0
    Gemini 3.1 Pro 74.0
    
    3. NL2Repo
    Model Score (%)
    GLM-5.2 48.9
    GLM-5.1 42.7
    Claude Opus 4.8 69.7
    GPT-5.5 50.7
    Gemini 3.1 Pro 33.4
    
    4. DeepSWE
    Model Score (%)
    GLM-5.2 46.2
    GLM-5.1 18.0
    Claude Opus 4.8 58.0
    GPT-5.5 70.0
    Gemini 3.1 Pro 10.0
    
    5. ProgramBench
    Model Score (%)
    GLM-5.2 63.7
    GLM-5.1 50.9
    Claude Opus 4.8 71.9
    GPT-5.5 70.8
    Gemini 3.1 Pro 39.5
    
    6. MCP-Atlas
    Model Score (%)
    GLM-5.2 77.0
    GLM-5.1 71.8
    Claude Opus 4.8 77.8
    GPT-5.5 75.3
    Gemini 3.1 Pro 69.2
    
    7. Tool-Decathlon
    Model Score (%)
    GLM-5.2 48.2
    GLM-5.1 40.7
    Claude Opus 4.8 59.9
    GPT-5.5 55.6
    Gemini 3.1 Pro 48.8
    
    8. Humanity's Last Exam
    Model Base Score (%) Score w/ Tools (%)
    GLM-5.2 40.5 54.7
    GLM-5.1 31.0 52.3
    Claude Opus 4.8 49.8 57.9
    GPT-5.5 41.4 52.2
    Gemini 3.1 Pro 45.0 51.4

Seems to be handily beating Gemini 3.1 Pro. What _is_ Google DeepMind doing (other than bleeding talent to A\ ) ?

reply

upvote

by vineyardmike15 hours ago|

[-]

> What _is_ Google DeepMind doing

I feel like it has been pretty visible about what’s happening, between their press and products and financial statements. It’s just not what people are accustomed to expect.

First, Google has become a major compute provider for competitors, thanks to TPUs. They’ve talked about allocating TPUs to GCP instead of their first party products. I can only assume it’s because they’re collecting a higher margin, and it covers the cost of data center buildout - which they’ve been aggressively doing. I wouldn’t be surprised if they made the financial decisions to delay or slow training for Gemini 3.5 when they provided last minute compute to Anthropic this spring.

Second, Gemini has very directly not been focused on agentic coding, maybe 3.5 Flash being the change. They’ve built models they can deploy to watch YouTube videos, Nest cameras, scale to AI in search, understand fitness info in Fitbit, etc. They’re very clearly not focused around agentic/coding. They’ve put in a ton of efforts into multimodal data in and out, and they’re the only major lab working on video generation still. There was leak/rumor that their cofounder (brin) was getting involved in the model training to renew focus on agents so maybe this will change, and again 3.5 already feels different.

reply

upvote

by 14 hours ago|

[-]

deleted

reply

upvote

by linzhangrun12 hours ago|

[-]

Just waiting for the 3.5 Pro they said would come out this month. Gemini is pretty much useless for any serious work right now.

reply

upvote

by verdverm16 hours ago|

[-]

copying the graphs and tables to HN is noisy and harder to read

reply

upvote

by JSR_FDED15 hours ago|

[-]

Still more helpful than this comment

reply

upvote

by nullbio11 hours ago|

[-]

The idea of an open-weight Mythos model is not scary at all. This space is moving so quickly that it'll looked at in 1-2 years as childs play.

reply

upvote

by Zopieux10 hours ago|

[-]

I don't understand those takes.

Open-weights perhaps, but definitely not self-hostable – since those require $20k+ capex – which is the real "step change" to me, as it ends the stranglehold providers have over censorship.

The only silver lining would be increased competition in API providers of those open-weight models leading to truly affordable prices and a race to remove stupid "safety" checks.

reply

upvote

by sibellavia11 hours ago|

[-]

While I agree with the post in its entirety, I think it would have been worth mentioning DeepSeek V4 Flash as well, which, in my view, had already reached a sufficient, if not high-level of agentic coding before GLM 5.2 (see DwarfStar).

reply

upvote

by ramon15610 hours ago|

[-]

I know very little about the current state of replacability of Opus but I do sometimes imagine a reality where Opus has been rebuilt as an open model. What plan does Anthropic have when it does happen?

Will they still rent out their own model, will they support the open model and become a resource provider? Will they be able to repay the billions of dollars ?

This is probably the first question I would ask someone from Anthropic, if I ever meet one.

reply

upvote

by olmo236 hours ago|

[-]

> Will they still rent out their own model, will they support the open model and become a resource provider?

Anthropic rents GPUs from xAI to run Claude. If there's an open weights competitor to Opus, why wouldn't Elon host it directly?

reply

upvote

by alpineman6 hours ago|

[-]

Did you read the article? Opus 4.5 has essentially been rebuilt already

reply

upvote

by mrngld6 hours ago|

[-]

Based on DeepSWE, Opus 4.8 gets you more intelligent output at lower price (GLM's token inefficiency is really biting them). GPT5.5 even moreso. And I don't recall about Opus but GPT is much, much faster at getting you the answer (again, GLM's token inefficiency).

It's neat, I guess, that we can compare them against models released last year, but I care about my options today, and the pareto frontier is about as far away as it ever was.

Add on top of that the extra features OpenAI and Anthropic have in their apps and...

reply

upvote

by alpineman3 hours ago|

[-]

As per the article, they are now about 6 months behind US frontier models, that's down from 9 months. The gap is closing

reply

upvote

by fraywing18 hours ago|

[-]

It feels like the gap is closing from an intelligence perspective. Or at least doing some kind of log flattening.

Been playing with GLM 5.2 in different contexts. It's less good if you don't max out thinking, but as xhigh it's been able to solve most problems I was throwing at Opus in the about the same amount of time (via OpenRouter).

Wild time to be alive.

reply

upvote

by JSR_FDED14 hours ago|

[-]

Anecdote, not “research”:

Yesterday I compared Deepseek, Kimi 2.6, MiMo 2.5 and GLM 5.2 for the same task (replace a custom token-based auth scheme with a cookies-based scheme across a front- and back-end codebase).

I used Opencode with the zen subscription to try different models.

All did this perfectly, basically indistinguishable from each other. However, when I pointed out that the new cookies-based auth didn’t allow multiple independent logins across browser tabs (which the previous scheme did allow) I noticed this:

Deepseek, Kimi, MiMo started giving me multiple options but advocating strongly that I should either accept this deficiency, or don’t use the cookies version (keep the old auth scheme). They were so similar it was as if they were all the same model.

Only GLM 5.2 said “here’s how to use cookies and also have tab-level separation”. The difference vs the other models was very stark.

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by 14 hours ago|

[-]

deleted

reply

upvote

by themgt20 hours ago|

[-]

I just tested GLM 5.2 out via Z.ai in pi for a little one-off project that was already scoped. It actually did a relatively decent job starting out, and figured important things out from context.

But the reasoning traces became increasingly hilarious, with it getting confused and going in loops, doubting itself. I began to feel almost sad, it was like listening to the internal monologue of someone with anxiety disorder.

It made pretty good progress but wound up going in a lot of goofy loops and doing things a bit "off" from standards I'd hoped it would infer, and finally started going a bit nuts, "This is very confusing.", "OH WAIT", seemingly hallucinating a whole side-quest that didn't make sense and looking at making internal system changes to try to achieve its (now very confused) goal when I pulled the plug.

Without seeing the reasoning traces from Claude/GPT it's hard to really know, but it definitely didn't feel like the same quality of reasoning, even if dogged persistence does wind up actually working eventually.

reply

upvote

by dools18 hours ago|

[-]

The reasoning traces always look terrible and they’re frustrating to watch. It’s the same with Kimi. What’s interesting is that the end result is then good. I think it’s just some sort of devils advocate trick to get better output.

reply

upvote

by rufo15 hours ago|

[-]

The reasoning tokens are really just there to extend the amount the LLM can "compute" the problem; put another way, the only way a given model can "think" more about a problem is to fill more of its context with predicted tokens, which has the effect of increasing the accuracy of each token. The reinforcement learning these models go through generally doesn't care what the chain of thought tokens look like (outside of preventing loops/gibberish/reward hacking), only how good the final answer is - so while it does look something like "reasoning" to us and has a rough correlation with the final answer, treating it as actually representative of what the final answer will be or an actual thought process is giving those tokens too much credit :)

reply

upvote

by fc417fc80214 hours ago|

[-]

For me what really drove this point home (that reasoning traces aren't "real" by any reasonable definition of the term) was noticing instances of things being out of order and exhibiting various inconsistencies with the final answer. My favorite was an example posted to HN that went something along the lines of the model first output the conclusion, then performed the supposed derivation after the fact, then stated it needed to verify the earlier conclusion to verify the derivation was correct so it hallucinated a tool call, then it remarked positively about the verification matching, and finally it output a slightly different answer. At no point was the answer actually correct although it was vaguely in the ballpark.

reply

upvote

by teravor14 hours ago|

[-]

as compared to what though? you can't see the actual think traces for opus or gpt.

reply

upvote

by dools13 hours ago|

[-]

Compared to what comes out at the end. Like if you sit there watching Kimi k2.6 "think", you're like "what? no you fucking idiot!" and you get this urge to "steer" it and so on, but very rarely is that steering actually necessary, it just winds up popping out the correct answer and all of those 'Wait! That's it! I found it! Actually ... Let me just' is just whatever internal processing it needed to use to get to the correct response. Mostly likely it's just being self-adversarial and exploring a bunch of dumb avenues to isolate the best outcome with the highest probability

reply

upvote

by try-working15 hours ago|

[-]

thinkslop recursion.

reply

upvote

by eunos6 hours ago|

[-]

I have a hilarious theory why GLM (and Kimi) have this thinkslop,

apparently Chinese language as token is more information dense than English, so having these wasteful thinkslop in Mandarin isnt that damaging. So the developer focus mostly in Mandarin and didnt think of handling these thinkslop while American AI labs do.

reply

upvote

by jauntywundrkind20 hours ago|

[-]

I think the self-doubt might actually be a very crucial part of it's capability. I often feel compelled to interrupt when I'm watching it think (which thank the stars it let's us do, unlike the big American models!!), but usually it makes the right pick!

Being willing and able to reconsider seems very good. Going around and around, pulling in more thinking, integrating it: maybe that's why it is as good as it's good.

I want to emphasize again how excellent it is that we can see the thinking. I think this makes GLM so much better an experience for me. It gives me such insight into what is being considered, helps me see where things go wrong. It grounds me, gives me the notion of where the results come from. It was so jarring to switch to GPT and Opus and find that they won't discuss with me, won't reveal their thinking: that feels fundamentally unsafe, for me, for society, to have such a severe black box. I don't think it should be allowed, honestly.

Many thanks to this recent submission, which is the first time I've seen anyone blog about this core difference: The text in Claude Code’s “Extended Thinking” output is not authentic. https://patrickmccanna.net/the-text-in-claude-codes-extended... https://news.ycombinator.com/item?id=48630535

reply

upvote

by wuhhh19 hours ago|

[-]

Your post made me laugh because I experienced the same as you but the other way around. I switched from Claude to a multi model harness a couple of days ago and the first model I tried was GLM5.2.

I gave it some simple code porting exercises and watched dumbfounded at the reasoning, which was more like the ravings of a lunatic - but lo and behold, after much confusion and a dizzying number of eureka moments the task was completed very successfully.

I tried Kimi on a similar task, much faster, a little more reassuring somehow in its ramblings, also surprisingly good results.

To be clear, I’m not surprised the results were good because they’re not GPT or Claude, but because the line of reasoning was so bonkers. Coming from Claude, I was just not used to seeing this, but I’ll bet it’s just as nuts with the frontier models and we’re just not allowed to see it (I’m about to read the links you shared).

Agree wholeheartedly that transparency is of grave importance.

reply

upvote

by nl18 hours ago|

[-]

If you look at the "thinking" traces as ways of expressions of uncertainty rather than literal thinking they make more sense.

Consider debugging - you start off in one place, think you have worked out what is happening, and then there is a "oh but what about xxx" thing that happens and you explore another branch. Then you "have it for sure" until you find another edge case.

The LLM is doing something analogous. It's writing circuits to try to emulate your program. Each time it gets one that seems right it is very sure that circuit is correct, but then it finds another thing.

At any point you can stop and go "write code now" and it will, and the code will seems fine provided it hasn't hit one of these edge cases.

Turning up thinking time is literally forcing more exploration.

The words that come out are amusingly dramatic, but... TBH when I debug I often are like "WTF" and throwing my hands up in the air at some gotcha I didn't expect.

reply

upvote

by rainmaking19 hours ago|

[-]

Yeah isn't that thinking weird?

Now I see the issue clearly! But wait... now I have the full picture! But wait... Found it!

I gave up a few times because of it at first until I realized I just had to let GLM get on with it and what came out was great!

But once it was outright endearing- challenging bug, it said: I have been very thorough. Then it escalated where to look and aced it. Built in confucian values

reply

upvote

by RugnirViking7 hours ago|

[-]

I'm like 90% sure the harnesses inject those tokens into the ai to make them check their work. Things like "but wait" and "but what if..." etc. Like literally inserting them artificially and then say "carry on from here" to the ai so it's working as though it itself output those words so the ai has an opportunity to turn around if its making a mistake. repeat a bunch of times and we get something useful.

I started noticing those in gh copilot right around when they turned off thinking traces end of last year

reply

upvote

by wuhhh18 hours ago|

[-]

If there’s one thing I’ve learned these past couple of days, it’s to resist the temptation to jab the escape button and start waving my arms! I wonder how much of this cyclical self doubt / self congratulating I go through in my own thoughts without even realising it. If you could verbalise or articulate all the half thoughts, snatches of ideas, feelings and ruminations the human mind goes through on some tasks it might be even more bizarre (or could just be me)

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by sosrobahu17 hours ago|

[-]

[dead]

reply

upvote

by melodyogonna7 hours ago|

[-]

American AI labs really need to start releasing good open-weight models.

reply

upvote

by fabijanbajo7 hours ago|

[-]

Agreed. Even just distilled versions of their frontier models would be a huge win for the open ecosystem

reply

upvote

by neosat18 hours ago|

[-]

I've been using GLM 5.2 recently (company hosted, for non-coding tasks) and it's been strong and reliable. There are areas where GPT 5.5 and Opus 4.x still feel marginally better but only marginally. For most tasks if GLM 5.2 is the only model I have to use I'm productive and happy. This was not true before GLM 5.2. No doubt in my mind that the gap is closing quickly and for most tasks that are not very specialized open models will be usably on par on flagship closed models and have an edge factoring in cost.

For coding I still use 5.5 w/ Codex and prefer that to other models + harness combinations.

reply

upvote

by GL264 hours ago|

[-]

if someone has any tutorial on how to run GLM-5.2 from a Rasberry Pi 5 (AI hat), I want it !

reply

upvote

by efficax3 hours ago|

[-]

GLM-5.2 is a huge model. I don't think it would fit on the AI HAT+ 2 even if you quantized it to 2 bits

reply

upvote

by seany18 hours ago|

[-]

What's the current best for ablation? Specifically chemistry and red-team/netsec?

reply

upvote

by forsalebypwner15 hours ago|

[-]

ime DeepSeek v4 Pro is great for cybersec/netsec, I have not tried GLM though

reply

upvote

by 16 hours ago|

[-]

deleted

reply

upvote

by newaccountman217 hours ago|

[-]

5.1 and Qwen 3.6 are great too IMO

reply

upvote

by yogthos15 hours ago|

[-]

It's by far the most competent open model I've tried yet. It's a bit slower than Claude, but in terms of coding capability it seems to get comparable results at least for the work I'm doing.

reply

upvote

by NovaCode3710 hours ago|

[-]

Honestly, glm is staying quiet close to claude but it can save tons of tokens either than anthropic model

reply

upvote

by nubg4 hours ago|

[-]

A question I always have is, how to the AI labs safeguard the leak of their model? Training a cutting edge model basically cost a minimum of hundreds of millions of dollars. And its all contained within a file. Okay, that file might be 500GB large, but its still just one blob that is worth almost a billion dollars. And they need to train new models every few weeks, have lots of people with access to it to debug it, run inference etc. I wonder when we will see the first leaks? Imagine if e.g. Opus 4.8 got leaked. Wouldnt that bankrupt Anthropic?

reply

upvote

by dools18 hours ago|

[-]

Is z.ai

Is 2 better than x.ai

reply

upvote

by alfiedotwtf8 hours ago|

[-]

Once open Chinese models look like they’re about to overtake closed US models, watch the US government push imperialism hidden behind increasingly hyperbolic national security concerns.

At the end of the day, open weights should be seen as nothing more than information (just more just numbers afterall), and so organisations like the EFF should sue for any restricting of the 1st Amendment

reply

upvote

by citizenpaul20 hours ago|

[-]

Ive been using glm5 since its release and still prefer it to glm5.1 and so far to glm5.2

Perhaps it is just my harness and workflow, but the older model still seems to work better. Also the token cost is significantly lower. I rarely spend more than $20 a week with $50 cap. Not even half claudes ambiguous minimum $200 a month plan.

reply

upvote

by rainmaking18 hours ago|

[-]

Now that's a tremendous pointer, I'm going to have to try that.

Do you full on let GLM5 get stuff done on its own or is it more like a guided workflow? The former's what the point releases doubled down on and is also something that uses a lot of juice.

reply

upvote

by ddemian1 hours ago|

[-]

[flagged]

reply

upvote

by bugthesystem5 hours ago|

[-]

[flagged]

reply

upvote

by Balinares2 days ago|

[-]

I can't help wondering what kind of models we'll see coming out of China once it gets its own chip fabs up and running. Right now it sounds like the US's export ban is not slowing them down a whole lot.

reply

upvote

by khurs5 hours ago|

[-]

>Right now it sounds like the US's export ban is not slowing them down a whole lot.

Just costing them a lot more money as they pay multiples more buying on the underground grey market.

reply

upvote

by ceejayoz20 hours ago|

[-]

> Right now it sounds like the US's export ban is not slowing them down a whole lot.

It may wind up being a massive boost to them in the long run, even.

Necessity is the mother of invention.

reply

upvote

by pkroll20 hours ago|

[-]

If this pans out, you're not at all kidding: https://www.youtube.com/watch?v=8ekndZwyOzo

reply

upvote

by verdverm16 hours ago|

[-]

Trump allowed more advanced chips (H200s) to be sold after his visit, because some people in the admin still believe the US can "addict" China to the hardware. It seems China is only letting a token few in, the ban is more on their side now, as Xi really wants indiginous capability.

reply

upvote

by pianopatrick18 hours ago|

[-]

There does not seem to be a big penalty for going slow anyways. People seem to just switch on cost as soon as a model can do a task well enough. There do not seem to be strong network effects or vendor lock in.

Seems to me that going slow is the better long term tactic. China can just let the USA pay the high R&D costs to figure out what works, then just copy what works.

reply

upvote

by briga18 hours ago|

[-]

With subsidization from the Chinese government they will probably be equal to or better than the models here. I mean, have you looked at the author list of any given AI paper published within, say, the past 5 years? I wouldn't be surprised if half or more AI researches are from China.

reply

upvote

by buzzin__17 hours ago|

[-]

Can you compare the amount to the USA subsidization? Which one is bigger? Per Capita? Per unit of economic growth achieved?

reply

upvote

by usef-16 hours ago|

[-]

You mean from the private investors? It seems the labs on both sides of the ocean are quite negative in their profitability right now due to the competitiveness. Though Anthropic claims they will have a profitable quarter this year (despite the huge build-out), so their margins on API costs are likely quite decent.

reply

upvote

by s_kazmi8 hours ago|

[-]

[dead]

reply

upvote

by modgate15 hours ago|

[-]

[flagged]

reply

upvote

by ideaxiaoshi12 hours ago|

[-]

[dead]

reply