Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.
The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).
All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.
You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.
People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.
One model can replace another at any given moment in time.
It's NOT a winner-takes-all industry
and hence none of the lofty valuations make sense.
the AI bubble burst will be epic and make us all poorer. Yay
But I agree it's close enough that it's worth using heavily. I've not cancelled my Claude Max subscription, but I've added a z.ai subscription...
Will try it out. Thanks for sharing!
If this was the case then Anthropic would be in a very bad spot.
It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.
Pi is better than CC as a harness in almost every respect.
- It still lacks support for industry standards such as AGENTS.md
- Extremely limited customization
- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.
- Obvious one: can't easily switch between Claude and non-Claude models
- Resource usage
More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.
Really liking pi and glm 5.1!
If Qwen3.6-Max is up there as well, it will be very interesting.
What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.
Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.
The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.
Yes, but... isn't the same true for Opus and all the other models too?
So you're either paying $1000's for Opus in Pi, or $30/month for GLM in Pi. If the results are mostly equivalent that's an easy choice for most of us.
It also compresses the context at around 100k tokens.
In case anyone is interested: https://github.com/sebastian/pi-extensions/tree/main/.pi/ext...
You have to keep it below ~100 000 token, else it gets funny in the head.
I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.
Evening CET experience for me is super smooth.
That would leave almost no tokens for actual work
And yes, sonnet/opus is better and what I use daily. But I wouldn’t be that upset if I had to drop down to GLM.
4.7 is better, but its also wildly expensive
So I am curious, how do people get these lazy outputs?
Is it by having one of those custom system prompts that basically tells the model to be disrespectful?
Or is it free tier?
Cheap plans?
In both cases the fix is really simple, just compact.
I was very impressed.
Every time codex reviews claude written rust, I can't explain it, but it almost feels like codex wants to scream at whoever wrote it.
FAANGS love to give away money to get people addicted to their platforms, and even they, the richest companies in the world, are throttling or reducing Opus usage for paying members, because even the money we pay them doesn't cover it.
Meanwhile, these are usable on local deployments! (and that's with the limited allowance our AI overlords afford us when it comes to choices for graphics cards too!)
You’re absolutely right!
Jokes apart, I did notice GLM doing these back and forth loops.
Also, thanks for pointing me at that specific paper; I spend a lot more of my life closer to classical control theory than ML theory so it's always neat to see the intersection of them. My unsubstantiated hypothesis is that controls & ML are going to start getting looked at more holistically, and not in the way I normal see it (which is "why worry about classical control theory, just solve the problem with RL"). Control theory is largely about steering dynamic systems along stable trajectories through state space... which is largely what iterative "fill in the next word" LLM models are doing. The intersection, I hope, will be interesting and add significant efficiency.
Could you please share more about this
One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.
Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file
As so many things these days: It's a cult.
I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6
But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"
For me it's just a tool, so I shrug.
I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.
1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.
2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.
As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.
> similar to how you can expect more over time from a junior you are delegating to and training
That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.
> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.
Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.
I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.
Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).
But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.
But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.
Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...
In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.
But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.
When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.
In the same way, it’s hard to see how people who say they’re struggling are actually using it.
There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.
We're also seeing that the people up top are using this to cull the herd.
At some point the is a need to have faith in some stable enough ground to be able to walk onto.
Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.
I'd encourage you to check it out for yourself. It's certainly possible to be a dogmatic Buddhist, but one of the foundational beliefs of Buddhism is that the type of dogmatic attachment you're describing is avoidable. It's not easy, but that's why you meditate.
East, West, Religion, Practice… From a Zen perspective, you're just troubling your mind with binaries and conflict.
The binaries still functionally exist. I see a lot of value in reflective practices. At the same time it seems unlikely to me that the point of existing is to not trouble your mind.
If Buddhism can be said to have a goal, it is to reduce suffering (including your own), so troubling your own mind is indeed something it can help with. The point of existence would be something interesting to meditate on. If you discover it, let us all know!
Dogma, like the binaries, still functionally exists, whatever the narrative. If you can’t admit that, that might also be something interesting to meditate on.
Say you have eliminated all suffering. How many versions of that world exist? How many of them are true, beautiful, and good? See how, in order to evaluate the success or failure of Buddhism, we have to move beyond “eliminate suffering” to a higher value standard?
I think in every domain, the better you are the less useful you find AI.