The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).
All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.
You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.
People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.
One model can replace another at any given moment in time.
It's NOT a winner-takes-all industry
and hence none of the lofty valuations make sense.
the AI bubble burst will be epic and make us all poorer. Yay
But I agree it's close enough that it's worth using heavily. I've not cancelled my Claude Max subscription, but I've added a z.ai subscription...
Will try it out. Thanks for sharing!
If this was the case then Anthropic would be in a very bad spot.
It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.
Pi is better than CC as a harness in almost every respect.
- It still lacks support for industry standards such as AGENTS.md
- Extremely limited customization
- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.
- Obvious one: can't easily switch between Claude and non-Claude models
- Resource usage
More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.