I think theyre absolutely needed. I can't afford 200 USD a month for personal use of coding AI, and I don't think such prices are reasonable for most of the world economy anyway. Not to mention US firms might be giving their employees a lot more than that.
It's increasingly feeling, to me, that theres a gap building up between haves and have nots. But then, we get news of these open weight models that are reasonably priced in inference with reasonable capabilities. Yes, they take maybe 6-9 months to get there, tbh, that's not a bad trade off at all.
Which of course causes some unfairness on both ends. Nobody here can compete with me. I often use left over tokens on local client projects; which despite lower pay, still pays off because they now take hours not days or weeks to complete. And nobody in the local clients talent pool can compete with me; unless they charge about half the market rate.
Take away my 500$ monthly grant; and I’d be more or less screwed. Better open models will more or less start to reduce this advantage. It’s not like I positioned myself here on purpose. But it’s definitely a „right place, right time“ situation.
This depends a lot on how you work, and how much of the architectural thinking you do yourself.
People seem to lose sight of the fact that a flash model today is as powerful as a frontier model from a year ago. If you were happy with GPT 4.x, you should be ecstatic that equivalent power is now basically free...
Mind if I ask you for a few vibe coding tips? I failed to solve you gh puzzle in the profile though.
I've been fooling around with DeepSeek 4 agentically. It's probably not as good as Anthropic offerings, but even those seem to be roiled in politics and strife and DeepSeek 4 is very good IMHO. I'll later try out GLM.
I'm in Australia. The government has set up a "return and earn" scheme to keep aluminium cans, plastic bottles and paper drink cartons out of the waste stream. A laudable project. The money you make from return drink containers is pretty low, $AU 0.1 per container. I've participated to get the rubbish out of natural water streams and to make a nano amount of money on the side.
When I looked at the costs of an app I was getting DeepSeek to help me with, I realised that the several hours I'd spent learning and building had cost something like 8 recycled containers. In my head after doing some DeepSeek stuff, I calculate a "cans per app" metric for myself for fun. I may even setup a simple graph to view my costs that way.
I kind of hope the Anthropics of the world get enough price competition from sources like DeepSeek and GLM to drop their prices significantly. Time will tell.
I'm using the Chinese DeepSeek provider, so everything done there could potentially be taken and used by the CCP... But this is hobbyist learning.
There is probably a market for Deepseek/GLM served from non CCP available servers. I might even look into how hard that would be to setup here.
I also hope that inference focused hardware will come to the fore, reducing energy use and cost. Realistically this will take time though, on the order of years.
Here in Oz, we have community batteries that community members can charge and later draw from. Their electricity prices are competitive. I wonder if someone could setup something like a community battery to run data centres... That way reasonable environmental consideration could be given to inference power generation... This might not work in a market like the US or Europe, but small market size might be an advantage... Who knows.
Please do. There is definitely a market for Deepseek / GLM hosted from non-China servers, there's over 20 providers for GLM 5.2 on OpenRouter alone... and they're all either Singapore (home of Z.AI / GLM), China, or US. There is nothing yet listed on OpenRouter from Europe (Inceptron still only has GLM 5.1). And of course, there is absolutely nothing hosted in Australia.
We're in a particularly dire situation in Australia. We're about to be cut off from Claude Fable and premium American models. The European Mistral models are garbage, at least in comparison to US models. Our only hope is going to be Chinese models (GLM 5.2 is good), and we're not even hosting them in Australia.
By the way, if you haven't tried an Anthropic model, it's worth spending at least $20 one month to give Opus 4.8 a try. I only got one night of access to Fable before I was cut off, but one single evening of Fable provided plans that I've been working through for about a week afterwards with Opus 4.8... and that was only Fable, not even Mythos. That's the kind of intelligence lead Australia is about to be cut off from.
(And kudos on the Containers For Change, that's something I do as well - mostly as an exercise incentive to walk to the local recycling machine, because the money certainly doesn't compensate for the time spent on the recycling.)
So two European providers at least
(Speaking as a not-so-proud Australian.)
Jeremy Howard was recommending fireworks.ai as a host of you want to go direct. Or there's Cloudflare.
For subscription alternatives people here on HN seem to mention Open Code Go a lot too https://opencode.ai/go
As opposed to Anthropic or OpenAI where everything done could potentially be taken and used by the US government.
Also, replace "could potentially" with "will definitely" in both cases, there's no conspiracy here.
We're stuck between two bad positions, so just use the one that's best for you, and wait for a better solution to arrive.
Why don't you exclusively host and use the open-weight western models, even if right now they don't perform as well?
I'd like to know the psychology behind this, because your actions feel contradictory to me.
A NYC dev and a dev in india have the same ai costs, based the ratio tokens/salary it becomes less of comparative disadvantage to be in NYC.
Now combine that with the fact that AI makes the act of generating code less a % time of the job, and the ability to get/refine requirements more of the job and you have a decent shift.
Has a very race-to-the-bottom feel to it.
Though in the grand scheme of it, $200/mo probably isn’t the real price either. Also looking at it not just in a vacuum - paying for a product that can change what you get from under you doesn’t seem great anyway.
At least with a locally-hosted model you know what you’re getting.
The LLM in a box is something you can buy today, but it 1. doesn’t serve over usb by default 2. costs $100k for hardware (not counting electricity) at 100 tps 3. can’t buy this from AliExpress.
Better to put that $100k in t-bills and just buy tokens even at api prices.
It's been awesome for embeddings and document OCR!
3D printing a case for it is on my todo list.
I’m using Qwen3.6:27B at home and mostly Sonnet/Opus (depending on the complexity of the task) at work.
You have to break things down into smaller chunks for the local models. For the bigger cloud ones they can do a lot of the broader thinking.
OpenAI already charges enterprise users a premium purely for that title over on-demand, no-contract usage. Retail users get a good deal. People make a lot of hay about subsidies but this is a very sane approach if you want exposure to these three different types of customers.
If that was true, they would be collaborating with each other and opening up all the results from their work.
The Chinese are genociding Uyghurs as we speak, purely for being Muslim, in numbers that dwarf any harm the US has done.
The list of wars the US is or was actively involved in[0] is SO LONG that the Wikipedia page is split into multiple different pages.
The main relevant ones are 20th[1] and 21st century[2], for which you better get a good grip on your mouse to scroll down.
I urge you to use your favorite AI to give you a rough summary of direct and indirect casualties of just those wars directly caused, started, or provoked by the US, from these lists.
For example, the "war on terror" alone has, so far, seen around 4.5–4.6 million+ people killed, and at least 38 million people displaced.
[0]: https://en.wikipedia.org/wiki/Lists_of_wars_involving_the_Un...
[1]: https://en.wikipedia.org/wiki/List_of_wars_involving_the_Uni...
[2]: https://en.wikipedia.org/wiki/List_of_wars_involving_the_Uni...
https://amnesty.ca/wp-content/uploads/2024/12/Amnesty-Intern...
Nothing China did comes close to this.
its not, this would require voted resolution to declare genocide. It was some report on inquiry by individuals with unknown bias.
He's sitting on a frontier model letting it burn a hole in his wallet that could actually pay for itself.
"Meta has been using Google’s Gemini large language model for most of its moderation and customer support, but staff have recently been told to switch to Meta’s new foundational model, Muse Spark, the people said."
https://www.ft.com/content/39251a31-4a9d-4870-b86c-dc6353d67...
0. https://openrouter.ai/compare/z-ai/glm-5.2/anthropic/claude-...
The sonnet tier sits below claude or chatgpt in terms of price but costs so much more than free models. If you are breaking downtasks now I'm not sure that 13 cents is worth it.
At work I'm struggling to keep my claude bill around $500.
Also if you run the “loops” they’re now yapping about, it will burn through enormous amounts of usage as well.
People speak of a permanent underclass.
https://www.nytimes.com/2026/04/30/opinion/ai-labor-work-for...
I get a lot more out of a 200/mo subscription now in a week than I did from them in a month.
Now obviously in today’s world they’d be using a 200/mo subscription themselves. But it’s not like money is nothing, software development doesn’t scale down below 1k/mo for anyone competent even in the poorest areas.
So a 200 USD subscription falls between 10% and 33% of an average brazilian developer's salary.
If you're running a business I agree it's a no-brainer, but the context here is for personal projects.
The median hourly wage in the US is $28/h, this equates to nearly 7.5 hours. A full day of work a month for the average person to use Claude with reasonable limits.
Yes, the people on $28/h may not be the software development types, so their income might not be as high, but these are the people who would probably be vibe coding the most since they aren't day to day programmers!
So not really comparable. I use Step 3.7 Flash locally, models are good enough for so many coding tasks even at the lower end! (Though I note that calling a 200B model "lower end" is kind of amusing)
Qwen and Gemma are great, but they need babysitting every 30 mins, which is quite a cognitive load.
I do think the Chinese models are good enough for an 80/20 rule use case.
I was thrilled to have Gemini Ultra for a month and use as many Opus tokens with AntiGravity as I could use, but I am happier using less capable models like DeepSeek knowing that it is more fun to do more of the work myself, it is a smaller hit on the environment, and incredibly cheaper.
someone did a webcam + agentic + capture of other computer bios/boot -> upload to image model -> back to agent
I subscribed to their max plan to try it out. It counted me 700M tokens and drained my weekly quota in under 2 days.
Quota just reset less than 24h ago and i'm already >60% weekly quota usage.
For reference the kind of work i did would have used somewhere between 3% and 5% of Codex max or Claude max.
The model is good, the plan is a scam
The downside is of course that they consume many more tokens off your plan, and also that they are significantly slower. Kimi K2.7 takes about 7x longer to finish the same benchmark tasks as DeepSeek V4 Pro on my router benchmarks (https://role-model.dev/).
So for now I'm happy with just two models: GPT and DeepSeek.
1. DeepSeek V3.2, V4 Flash, V4 Pro, at high or max thinking, ... when recommending a model it should always be a precise model, not just an AI lab
2. DeepSeek V4 Flash at max thinking is the most verbose model (among top models) in the AA benchmarks. See the "Intelligence Index Token Use" chart: [1]
[1]: https://artificialanalysis.ai/models?models=gpt-5-5-high%2Cg...
I haven't tried deepseek yet, i should check this one out.
If it is needing to generate that many tokens to do the same tasks, then it probably has higher inference costs. So (for you) the model is bad, the plan is the same plan.
"Make a pac-man game in a single html page"
It went off and argued with itself for 20 minutes about how to lay out the map and then timed out.
I don't think that's true. When I look at GitHub's incident history,[0] it doesn't read to me like a company that's struggling to cut costs. It looks like a company that's trying to do a million things to serve a million use cases, and the growing interconnections between all those distinct services and workflows cause unexpected failures.
And when i can use it, it just drains the quota 5 times faster than codex or claude.
Their plan is a scam
1. My own harness + Local (which usually means Qwen3.6-35B-A3B), I use this fairly often for research gathering on topics, info gathering on code bases, etc.
2. My own harness + DeepSeek v4 Flash served by DeepSeek, I added $20 quite some time ago and somehow still have $18.77 in there after I don't know how many prompts. I use this pretty often, slightly less than my local setup, it's great and what I'm planning on running locally (eventually).
3. My own harness + OpenRouter with whichever model I want to try out. I use this very rarely.
4. Pi + OpenAI Codex $20 subscription. I don't use this almost at all anymore, but I keep the Codex subscription for testing things out to see how GPT-5.5 will handle a problem the other setups have issues with.
> Why do you trust it with serving full quality?
The only thing I've noticed seems unbearably useless sometimes versus what I noticed before was GPT-5.5 which has had some of the weirdest degradations I've seen. It's not to Anthropic levels but it definitely had some service issues a few times where I was wondering if they had accidentally (or purposefully) lobotomized it.
Everything else has mostly just been the same, except DeepSeek I noticed had some speed issues a few days ago.
> What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps)?
I pretty much only use my own, agents are trivial to make and it's definitely not hard to make one that's better than Claude Code or Codex for whatever you're doing.
Are my little hacks as effective as OpenCode or Claude Code? No way, but I am learning a lot and having fun.
OpenCode works fine, i just find it very resource intensive for no good reason.
The differences between the models are minimal, but I usually stick with gpt-5.4-mini, gpt-5.4, mimo-pro-2.5, deepseek-v4-pro. These latter ones have way more usage than even using 5.4-mini so I tend to use them in personal projects for that reason.
My harness is https://github.com/can1357/oh-my-pi. I trust it...enough. It updates very frequently so as a safe guard I run it sandboxed with https://github.com/containers/bubblewrap so it can only access the project folder and some whitelisted config files
It goes pretty quick, but it's still a great deal. Highly recommended.
> What provider do you use.
OpenRouter with pinned DeepSeek provider or OpenCode Go > Why do you trust it with serving full quality?
Quality seems good so far. > What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps).
I wrote my own. A minimal harness without dependencies is only 65 lines of Python.I do not trust any of them. Everything runs inside virtual machines, not just the sandboxes provided by the harnesses. I also do not run Claude or Codex directly on the host machine. Not just because of supply chain fears, but also because of how incredibly user hostile the VC funded companies are when it comes to installing random stuff on your machine.
1. SWE-bench Pro
Model Score (%)
GLM-5.2 62.1
GLM-5.1 58.4
Claude Opus 4.8 69.2
GPT-5.5 58.6
Gemini 3.1 Pro 54.2
2. Terminal-Bench 2.1
Model Score (%)
GLM-5.2 81.0
GLM-5.1 63.5
Claude Opus 4.8 85.0
GPT-5.5 84.0
Gemini 3.1 Pro 74.0
3. NL2Repo
Model Score (%)
GLM-5.2 48.9
GLM-5.1 42.7
Claude Opus 4.8 69.7
GPT-5.5 50.7
Gemini 3.1 Pro 33.4
4. DeepSWE
Model Score (%)
GLM-5.2 46.2
GLM-5.1 18.0
Claude Opus 4.8 58.0
GPT-5.5 70.0
Gemini 3.1 Pro 10.0
5. ProgramBench
Model Score (%)
GLM-5.2 63.7
GLM-5.1 50.9
Claude Opus 4.8 71.9
GPT-5.5 70.8
Gemini 3.1 Pro 39.5
6. MCP-Atlas
Model Score (%)
GLM-5.2 77.0
GLM-5.1 71.8
Claude Opus 4.8 77.8
GPT-5.5 75.3
Gemini 3.1 Pro 69.2
7. Tool-Decathlon
Model Score (%)
GLM-5.2 48.2
GLM-5.1 40.7
Claude Opus 4.8 59.9
GPT-5.5 55.6
Gemini 3.1 Pro 48.8
8. Humanity's Last Exam
Model Base Score (%) Score w/ Tools (%)
GLM-5.2 40.5 54.7
GLM-5.1 31.0 52.3
Claude Opus 4.8 49.8 57.9
GPT-5.5 41.4 52.2
Gemini 3.1 Pro 45.0 51.4
Seems to be handily beating Gemini 3.1 Pro. What _is_ Google DeepMind doing (other than bleeding talent to A\ ) ?I feel like it has been pretty visible about what’s happening, between their press and products and financial statements. It’s just not what people are accustomed to expect.
First, Google has become a major compute provider for competitors, thanks to TPUs. They’ve talked about allocating TPUs to GCP instead of their first party products. I can only assume it’s because they’re collecting a higher margin, and it covers the cost of data center buildout - which they’ve been aggressively doing. I wouldn’t be surprised if they made the financial decisions to delay or slow training for Gemini 3.5 when they provided last minute compute to Anthropic this spring.
Second, Gemini has very directly not been focused on agentic coding, maybe 3.5 Flash being the change. They’ve built models they can deploy to watch YouTube videos, Nest cameras, scale to AI in search, understand fitness info in Fitbit, etc. They’re very clearly not focused around agentic/coding. They’ve put in a ton of efforts into multimodal data in and out, and they’re the only major lab working on video generation still. There was leak/rumor that their cofounder (brin) was getting involved in the model training to renew focus on agents so maybe this will change, and again 3.5 already feels different.
Open-weights perhaps, but definitely not self-hostable – since those require $20k+ capex – which is the real "step change" to me, as it ends the stranglehold providers have over censorship.
The only silver lining would be increased competition in API providers of those open-weight models leading to truly affordable prices and a race to remove stupid "safety" checks.
Will they still rent out their own model, will they support the open model and become a resource provider? Will they be able to repay the billions of dollars ?
This is probably the first question I would ask someone from Anthropic, if I ever meet one.
Anthropic rents GPUs from xAI to run Claude. If there's an open weights competitor to Opus, why wouldn't Elon host it directly?
It's neat, I guess, that we can compare them against models released last year, but I care about my options today, and the pareto frontier is about as far away as it ever was.
Add on top of that the extra features OpenAI and Anthropic have in their apps and...
Been playing with GLM 5.2 in different contexts. It's less good if you don't max out thinking, but as xhigh it's been able to solve most problems I was throwing at Opus in the about the same amount of time (via OpenRouter).
Wild time to be alive.
Yesterday I compared Deepseek, Kimi 2.6, MiMo 2.5 and GLM 5.2 for the same task (replace a custom token-based auth scheme with a cookies-based scheme across a front- and back-end codebase).
I used Opencode with the zen subscription to try different models.
All did this perfectly, basically indistinguishable from each other. However, when I pointed out that the new cookies-based auth didn’t allow multiple independent logins across browser tabs (which the previous scheme did allow) I noticed this:
Deepseek, Kimi, MiMo started giving me multiple options but advocating strongly that I should either accept this deficiency, or don’t use the cookies version (keep the old auth scheme). They were so similar it was as if they were all the same model.
Only GLM 5.2 said “here’s how to use cookies and also have tab-level separation”. The difference vs the other models was very stark.
But the reasoning traces became increasingly hilarious, with it getting confused and going in loops, doubting itself. I began to feel almost sad, it was like listening to the internal monologue of someone with anxiety disorder.
It made pretty good progress but wound up going in a lot of goofy loops and doing things a bit "off" from standards I'd hoped it would infer, and finally started going a bit nuts, "This is very confusing.", "OH WAIT", seemingly hallucinating a whole side-quest that didn't make sense and looking at making internal system changes to try to achieve its (now very confused) goal when I pulled the plug.
Without seeing the reasoning traces from Claude/GPT it's hard to really know, but it definitely didn't feel like the same quality of reasoning, even if dogged persistence does wind up actually working eventually.
apparently Chinese language as token is more information dense than English, so having these wasteful thinkslop in Mandarin isnt that damaging. So the developer focus mostly in Mandarin and didnt think of handling these thinkslop while American AI labs do.
Being willing and able to reconsider seems very good. Going around and around, pulling in more thinking, integrating it: maybe that's why it is as good as it's good.
I want to emphasize again how excellent it is that we can see the thinking. I think this makes GLM so much better an experience for me. It gives me such insight into what is being considered, helps me see where things go wrong. It grounds me, gives me the notion of where the results come from. It was so jarring to switch to GPT and Opus and find that they won't discuss with me, won't reveal their thinking: that feels fundamentally unsafe, for me, for society, to have such a severe black box. I don't think it should be allowed, honestly.
Many thanks to this recent submission, which is the first time I've seen anyone blog about this core difference: The text in Claude Code’s “Extended Thinking” output is not authentic. https://patrickmccanna.net/the-text-in-claude-codes-extended... https://news.ycombinator.com/item?id=48630535
I gave it some simple code porting exercises and watched dumbfounded at the reasoning, which was more like the ravings of a lunatic - but lo and behold, after much confusion and a dizzying number of eureka moments the task was completed very successfully.
I tried Kimi on a similar task, much faster, a little more reassuring somehow in its ramblings, also surprisingly good results.
To be clear, I’m not surprised the results were good because they’re not GPT or Claude, but because the line of reasoning was so bonkers. Coming from Claude, I was just not used to seeing this, but I’ll bet it’s just as nuts with the frontier models and we’re just not allowed to see it (I’m about to read the links you shared).
Agree wholeheartedly that transparency is of grave importance.
Consider debugging - you start off in one place, think you have worked out what is happening, and then there is a "oh but what about xxx" thing that happens and you explore another branch. Then you "have it for sure" until you find another edge case.
The LLM is doing something analogous. It's writing circuits to try to emulate your program. Each time it gets one that seems right it is very sure that circuit is correct, but then it finds another thing.
At any point you can stop and go "write code now" and it will, and the code will seems fine provided it hasn't hit one of these edge cases.
Turning up thinking time is literally forcing more exploration.
The words that come out are amusingly dramatic, but... TBH when I debug I often are like "WTF" and throwing my hands up in the air at some gotcha I didn't expect.
Now I see the issue clearly! But wait... now I have the full picture! But wait... Found it!
I gave up a few times because of it at first until I realized I just had to let GLM get on with it and what came out was great!
But once it was outright endearing- challenging bug, it said: I have been very thorough. Then it escalated where to look and aced it. Built in confucian values
I started noticing those in gh copilot right around when they turned off thinking traces end of last year
For coding I still use 5.5 w/ Codex and prefer that to other models + harness combinations.
Is 2 better than x.ai
At the end of the day, open weights should be seen as nothing more than information (just more just numbers afterall), and so organisations like the EFF should sue for any restricting of the 1st Amendment
Perhaps it is just my harness and workflow, but the older model still seems to work better. Also the token cost is significantly lower. I rarely spend more than $20 a week with $50 cap. Not even half claudes ambiguous minimum $200 a month plan.
Do you full on let GLM5 get stuff done on its own or is it more like a guided workflow? The former's what the point releases doubled down on and is also something that uses a lot of juice.
Just costing them a lot more money as they pay multiples more buying on the underground grey market.
It may wind up being a massive boost to them in the long run, even.
Necessity is the mother of invention.
Seems to me that going slow is the better long term tactic. China can just let the USA pay the high R&D costs to figure out what works, then just copy what works.