I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price. And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable. Almost like buying a Ferrari for your daily commute instead of a Toyota or even a Mercedes.
I think there are several factors. Certainly marketing making us think we need the shiny thing which is rampant online and very smart people think they aren't susceptible to. There's a lot of really odd 'I trust Anthropic/OpenAI more than Deepseek' which tends to ignore, for starters, that you can run choose your provider and still save a ton. I also think there's some amount of addiction and brand loyalty where a Ferrari is one hell of a drive so that you turn your nose up at that sensible Toyota. Oh the other one I see used is like oh only fable can oneshot updating my embedded systems thing from 1975 to rust which is great but let's recognize how niche that is.
And it ends up just coming across as people are getting SO reliant on the tools so fast. Maybe it's ok to think and like read a few lines of code and work with these agents to convert your thing to rust or center your div. Even if coding is over which in some sense it certainly is, don't turn your mind into the wall-e people yet. I found myself guilty of this so often. It takes way more time and effort to do things via prompt and I wouldn't just open the editor and fix it because that dopamine hit of the magic the abstraction provided was so strong.
So I'm pretty much done using the 'best' (on benchmarks, if money isn't an object, etc etc) models available. After a year on Sonnet/Opus/GPT5x I'm having way better results with open weights models that don't get lobotomized weekly. I'm finding ways to do the crafting part of building software by focusing on honing my harness and workflow. I'm enjoying changing the oil on my Toyota after a year of almost flying off cliffs in my Ferrari and if I can check my ego it's a purely positive thing.
This is the logical end point of the fear-based way LLMs are marketed. You must want the best, because everyone who has the best can work faster than you, generate more — if you don't have the best, you are behind! Why would you want to use anything other than the best?
The thing is, once everyone has the best, the question is: how much can you spend? If you can't spend more, you are behind! If spending the most will get you ahead, why would you not want to spend the most, if you can afford it?
There is only one way through this, in the long run: work out a way forward that doesn't make you dependent on this cycle. If you can compete at all, without the spend, what happens is: they burn money and you don't.
FWIW so far I don't think the benchmarks prove very much about the actual experience, and you can discover this just as easily without spending any money. And we know this about benchmarks! Once a benchmark seems useful as a measurement, it becomes a target and it stops being as useful.
I think your strategy is right. It requires bravery, and as you say, it requires ego balance. But I believe it is obvious that the world will either come around to a more sensible, stable pattern or it doesn't matter either way because we're fucked. So opting out of this mad early cycle and choosing to be calmer and happier is a choice you can just make.
The difference is how the model is used.
With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"
With the lessor models the code is fine, but they need something else to plan what needs to be done.
GLM-5.2 is the third model (after Opus 4.6+ and GPT-5.5) that can do this agentic style work.
Notably Gemini 3.1 Pro is notoriously bad at this style work - the code is good, but it drifts off task most of the time. 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.
You are free to do you. But you were asking about why others want the best model.
The answer is, clearly, agentic coding (ie multiple agents each cranking through tasks independently) lets you ship A LOT more business value if used correctly.
And hey, don't get me wrong, you can get pretty far with just prompting. But the subtle misses and (I'm looking at you GPT) the overengineered 20k line PRs to do a simple thing are going to cost you a lot if you're not vigilant.
I don't think anyone is stopping you. This is an entirely valid way of working.
I for one am glad to leave that behind me. The sooner I never have to write another line of code the better (professional software engineer for nearly 30 years here, for context).
I am still struggling how to deal with sub agents and different roles for each model. I still think Claude or Codex are overall better models, but everything around them transpires such weird vibes, including, and this one kills me, that at certain times they feel like dumbed down.
I keep changing these things often, but I have basic subscription to codex (20$ plan) which I use with GLM 5.2 to do some high level planning of what I intend to do, and then leave Deepseek do the coding. Or something along those lines.
Point is, GLM 5.2 is now at a point where I cannot tell you if it's better or worse. I can tell you however one thing: no matter when I use it, it's consistent in what it does and how it works.
Then there is the Fable thing, but as with many things, I think the past has distorted the reality. It lasted two days, but Anthropic said it clearly for plan users it would only be there for two weeks. It was great for doing what you can already do with other tools: doing all the planning, and reviews, and launching a million subagents talking to each other. I sometimes wonder if it was really a new model, or just Opus 4.9 wrapped with some fancy model driven harness.
As for Fable: I used it as much as I could while we had it.
It was a step change over Opus with my work.
I've had no trouble getting the current generation of smaller models to do the same thing. Maybe it's more of a harness issue than a model issue?
Recently I've used both MiniMax M3 and DeepSeek V4 Flash to one-shot moderately complex applications from a written spec, and neither one got lost along the way
Price and speed, for me. GLM5.2 is "good enough" for some tasks, but rather slow (on their coding plan). In the time it takes GLM to "read files to figure out...", gemini flash is usually finished. It's not SotA for coding, but it's fast and often "good enough" for normal tasks.
For Flash 3.5?
I'm a big fan of Gemini 3.1 Flash Lite Preview (yes that is the name..).
I keep a agentic SQL benchmark up to-date to test new models. It's more-or-less saturated above 23/25 but below that is still useful, and even at that level is good for comparing speed, cost and toke efficiency.
3.1 Flash Lite Preview scores 22/25 in 142 seconds for $0.02. That's a great result if you care about cost for performance.
3.5 Flash scores 20/25 in 367 seconds for $0.76. The slow speed is because it takes a lot of tokens to generate its results, so even if tokens are produced quickly it takes too many to get a positive result.
There's nothing I've seen or heard that indicates 3.5 Flash is better than this indicates.
https://sql-benchmark.nicklothian.com/?highlight=google_gemi.... vs https://sql-benchmark.nicklothian.com/?highlight=google_gemi... (click the cells to see the traces)
I used to be obsessed with what's the best model. Then a while back when the new best model came out, I tested it on a task. I also tested its little brother (much smaller model from same company).
They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...
With the wealth of models available (open source vs closed, api vs local), I find optimizing the cost-efficiency of your token consumption an important part of business-oriented AI engineering. You don't need "the best" for every task.
Same for me, I certainly don't have the same definition of success and failure either.
A more expensive model has *less* rooms for wandering around than a cheaper model.
If Claude wanders around during 10min until finding the most obvious solution, then I count it as a failure.
To me this is a "more expectations mean more disappointment" situation.
Some people have higher expectations than others, and even the best model available is not good enough for what those people really want it to do once you start digging. In that light, the goal is not using the best model, but rather using the least insidiously deficient model.
Many people chase the edge because it's the least disappointing.
> when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.
The fatuousness of this statement pretty quickly becomes apparent if you spend more time looking at it, IMO, because the venn diagram of "damn good" and "not nearly good enough" strongly overlaps. Even the best model writing excellent lines of code still has noticeably deficient ability to decide which excellent lines of code to write. The goal is to improve the separation between them, not save a few dollars, because the emotional effort is worth more to us than the money.
> And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable.
Your minimization of performance differences and maximization of stability differences is exposing your biases.
Side note: I think you should know that to me at least some of what you said reads like self-rationalized moralizing. I couldn't help but imagine Principal Skinner saying "Am I so out of touch? No, it's the children who are wrong." People don't only want different things than you do because they don't know what they're doing.
Meanwhile, it's such a cheap model that I've spent not even $25 over 3 weeks.
This also allows me to play with, and mix codex, claude cli, and others. This is my happy spot for the last two months.
In my case, I rarely ever go over the Claude/ChatGPT subscription limits, so might as well use those considered-best models. If I had to generate millions of lines of code, maybe I would've used the open models more.
One group is trying to get the LLM to basically one shot everything and not properly reviewing the output.
Others are using the LLM to assist their human intelligence in a tight loop.
If you’re doing the former you really do need the best model available because that’s still right on the edge of what LLMs can do at best, and at worst you’re just shipping pure unmaintainable slop.
If you’re doing the latter then you can get away with a slightly less powerful model without it making a material difference because your human intelligence is filling in gaps
I generally started by reviewing but after a while (maximum in hours), I just can't keep up and resort to LLMs as sole reviewers.
While the single functions/classes/structs/... can be well though out the code tends to lack cohesion, and especially maintainability. For instance, it never thinks: "I could put this logic in an interface/trait so that if the requirements change I can simply add a concrete implementation that satisfies the new requirements (and potentially use one of these for testing)".
SoTA models can do reasonably good jobs on each ticket, but over time the architecture of the application starts degrading without a human in the loop.
The entropy increases slower with better models but the trend is always towards slop
It's not fast-changing, it's not abstract, it's just not that difficult, and where it is difficult, the AI cannot help you, because it is not capable of things you are capable of.
Learn CAD yourself. Honestly; I was sure I would never manage to learn CAD but it turns out to be interesting, rewarding, valuable and actually quite quick to learn.
An LLM certainly is not going to be able to do it better than you once you have a tiny bit of experience. (PCB design, perhaps, has a language to it that an LLM can make a bit more headway into, but as a non-PCB-designer I would still bet that it's more like CAD than code)
It has been hard to explain that they are in fact just creating toy versions and there is no way they can do it without learning the underlying architecture. But they just keep going wasting 100s of dollars , lost in a sea of bugs
Dabbled with OpenSCAD as we will. I decided to learn FreeCAD and what I discovered is that, even putting aside FreeCAD's many documented issues, parametric GUI CAD is not an imprecise, clumsy or fiddly way to work.
It is expressive, precise, generally capable of all the things that code-CAD can do and much more, and it's much, much quicker to work in, once you've learned a few core principles.
As you say, there is an underlying architecture; it's not just a sort of 3D paint package.
The problems the text-as-whatever crowd have are all Dunning-Kruger things in the truest sense.
People who are unaware they are unskilled in a particular technology are unlikely to successfully replace it with another. Particularly one that requires describing the problem domain in precise language.
Quite often when you see text-to-CAD discussions, especially here, there's evidence of profound misunderstandings from the people who think they are going to automate it. They assume their frustrations with the tools stem from limitations of the tools, not from the limits of their understanding.
As a person with decades of experience of code I have found learning how to use LLMs effectively to be much, much harder than learning CAD.
I've been programming since I was a kid. I enjoy it a lot, I like knowing how things work, I get excited about new compiler features, I stayed up every night for a week when I discovered Lean 4, etc etc etc.
At the same time I realized a few years ago that I just don't want to write any code ever. Or read any code. Coding is addictive and fun, but I'd rather talk to the computer and have things magically get done. (FWIW learning how to use LLMs feels more.. fulfilling, too)
Anyway. GLM 5.2 is nice and all, but I might have to spend half an hour guiding it to come up with a plan I'm happy with. And with Opus it could be 15 minutes. I'm still going to spend an hour talking to LLMs one way or the other, but with Opus it will be a less frustrating hour. If Fable gives me a frustration-free hour, I'll switch to Fable.
The reason is pretty simple and has to do with statistics: on long-horizon tasks, small errors and deviations from the "good path" compound.
That changes as soon as the developer is the one paying for a model. Then it's a classical engineering trade-off between money and quality, and that's where open models are clear winners.
The problem isn't what they do in a blank state. It is how they get there and the edge cases. Some models also take longer (uses more steps) i.e. end up costing more despite being "cheaper".
I've seen models:
- Back out plans non-stop. Tried the obvious path. Invents X/Y/Z excuse (without verifying) that it can't be done. Notes that down and moves on. It could be as simple as site A being down and to download from site B but that's it.
- Hacks the test to make it work. Code is wrong? Nah, let's update the test.
- Keep saying useless things like YAGNI and infinite excuses like too risky to never do the work.
- Claims they are done but there's 100 edge cases not covered. When you try to use it it fails in ways you as a human assume it should work. You can write a spec to cover it all but then what's the point?
- Be trigger happy and never investigate. Tries to do it. 5 minutes. Oh it failed. Back out. Repeat. Better models definitely spend more time analyzing and actually "think". I've had models spend hours trying to do a change due to this method when an actual investigation (code walkthrough) might have solved it.
- Know and use the right tools. A lot of lesser models have infinite fear e.g. oh docker might not be available (it is) or this and that (even if you nudge it in any way) and spend a lot of extra time "working around" it.
The list goes on. Better models definitely help.
Only thing to agree on is no you don't need Fable but saying Sonnet can do the job instead of Opus is a different story. It's so obvious when Sonnet touches the code that I can't give it more than 5 minutes. It lies. Doesn't check. Forgets things and then messes up.
That’s not super surprising to me, but, given the apparent randomness of the stack ranking, is GLM actually worse than any of the Anthropic models? This looks like a 10-way tie to me.
We now use Sonnet 4.6 for a number of internal use cases we wouldn't have considered otherwise.
4.7 was so bad, I locked a bunch of my machines to 4.6.
I haven’t bothered locking the 4.8 machines to 4.6. There was a HN thread a while back where they run swe bench a few times a day and measure success rate and latency. It showed opus getting significantly dumber for the week before a recent launch.
It wouldn’t surprise me if they’re quantizing to improve margins or to hype models in comparative testing in order to defraud investors at IPO.
Or, maybe QA is hard. Anyway, I think they hit a performance wall sometime at or before 4.6.
Just want to express how amazing that is. Opus 4.6 is an amazing model. That an open weight model like GLM 5.2 competes with it is nothing short of outstanding.
On the contrary, I personally think these broader benchmarks are meaningless. I think personalized benchmarks are the way to go. They should answer "How does this model perform for MY use-case?" rather than trying to answer "How does this model perform across all coding environments?"
Case in point: I use Elixir which is not as popular as Python, is always a hit or miss with most SOTA models at the top of these benchmarks. Whereas, the ones in the middle of the benchmarks (like the GLM) almost always outperform even SOTA models from Google / Anthropic. However, this is relevant only for my use case and I wouldn't just advocate a model for everyone based off my use-case alone.
We test 11 popular/interesting languages (you can see the Languages chart to filter), but not Elixir -- although other evaluations have found that many LLMs solve more problems when working with Elixir [0]. Why models write code well in some languages over others seems to go beyond pre-training data (Python scores quite low for most models) and we don't fully understand it.
[0] https://elixirforum.com/t/llm-coding-benchmark-by-language/7...
Sounds like this is the year for coding.
They did this to themselves.
So having a model of 4.6 quality is still extremely awesome. That currently is more of less the frontier reference outside the US :(
Congrats, now you’re paying an engineer’s salary to make your engineer at best 20% more productive.
Better to hire another engineer, or two jrs, and build up your in house talent.
Most of the gains right now come from tooling and process and any big post 2025 language model. The specific model isn’t that important right now.
It doesn't have a higher capability score than Fable, though. We break our coding evaluations into 2 parts, and "one-shot coding" makes up part of the index, where Fable significantly outperforms every other model, which is why it's ranked at the top despite Sonnet 4.6 having a slightly higher median (and lower average) in long-horizon agentic workloads. One-shot coding tends to be the most correlated with other companies' model cards, whereas agentic coding is partly about how well a model can adapt to a custom harness. Fable also refused some tasks.
Data at https://gertlabs.com/rankings?ow=1&mode=oneshot_coding
When I first used Opus 4.8, I threw several different workloads I had at it - I have Claude doing a lot of misc projects whose primary purpose is pretty much just studying what AI agents can do for my own curiosity and no other reason. Opus 4.8 was one of the first models I ever snuck in there that basically ran out of control. No previous Opus or Sonnet model I had used ever did this. Within hours every agent I had running was writing non-sense tool calls that echoed pretend commands that didn't exist, like 10 in a row, and talking about the "tool channel" being dirty. I switched back to Opus 4.7 and assumed Opus 4.8 was legitimately just broken.
I did come back to Opus 4.8 and found that it was indeed, pretty powerful. But that initial experience has stuck with me on just how narrow of a perspective any given test or benchmark is guaranteed to have. LLMs are too broad, it really doesn't matter what you try to do in your benchmark, you will necessarily get a limited view of what the model is capable of and its shortcomings. This will remain true for at least as long as models are susceptible to massive swings in performance based on randomness and minor differences in prompts and other environmental factors.
I'm not saying benchmarks are useless or that your benchmarks are not possibly closer to the truth either. All evidence at least points to the idea that Chinese models perform very well in coding but often have more mixed results on other tasks. I'm just saying that at this point, benchmarks feel like they have limited connection to my actual real experiences. GLM 5.2 actually scored kinda meh on a lot of benchmarks (compared to closed frontier models) but my actual experience using it does not match this.
And I'm definitely not saying GLM 5.2 is better than the frontier LLMs here, just that the race is close. I still prefer GPT 5.5 right now for code review, I think, and Opus clearly has some advantages depending on the task. It's just no longer a given that Opus 4.8 will perform better than GLM 5.2 on any given task, so to me the calculus behind "using the best model available" is getting complex and you might need to get a feel for what models have what strengths to really figure it out.
I do feel like the "use the best model available" mentality is not going to die any time soon, but if it does die, it will be gradual and start soon for programming. Modern LLMs are still not a full superset of what human programmers can do, but still larger models are definitely starting to hit diminishing returns for tasks at the lower end of complexity, and that is a big deal. It's a weird world where some tasks you can feel kinda confident just throwing Gemma 4 at it and not sweating whether you should use a better model; I've certainly done it for some quick Python scripts or getting an overview of some code I'm unfamiliar with.
Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.
I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.
https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...
Notice he's using "trust me bro" benchmarks.
Can we just remove all the motivated speech on HN? This is just not trustworthy information at all and obviously is incentivized.
Everyone is grinding and marketing nobody is actually discussing anything for real.