A model cannot read your mind. It can guess, and those guesses are more likely to be wrong if you don't give it the right input, and model performance gets worse if not steered/curated properly. The output depends on the input.

https://medium.com/@adambaitch/the-model-vs-the-harness-whic... | https://aakashgupta.medium.com/2025-was-agents-2026-is-agent... | https://x.com/Hxlfed14/status/2028116431876116660 | https://www.langchain.com/blog/the-anatomy-of-an-agent-harne...

(I don't think anecdotes are useful in these comparisons, but I'll throw mine in anyway: I use GPT-5.4, GPT-5.3-Codex, Gemini-3-Pro, Opus, Sonnet, at work every week. I then switch to GLM-5.1, K2-Thinking. Other than how chatty they get, and how they handle planning, I get the same results. Sometimes they're great, sometimes I spent an hour trying to coax them towards the solution I want. The more time I spend describing the problem and solution and feeding them data, the better the results, regardless of model. The biggest problem I run into lately is every website in the world is blocking WebFetch so I have to manually download docs, which sucks. And for 90% of my coding and system work, I see no difference between M2.5 and SOTA models, because there's only so much better you can get at writing a simple script or function or navigating a shell. This is why Anthropic themselves have always told people to use Sonnet to orchestrate complex work, and Haiku for subagents. But of course they want you to pay for Opus, because they want your money.)