undefined

upvote

points

by jfaat16 hours ago |

upvote

by dofm8 hours ago|

[-]

> I'm trying to wrap my head around exactly why so may people seem to want the best model available

This is the logical end point of the fear-based way LLMs are marketed. You must want the best, because everyone who has the best can work faster than you, generate more — if you don't have the best, you are behind! Why would you want to use anything other than the best?

The thing is, once everyone has the best, the question is: how much can you spend? If you can't spend more, you are behind! If spending the most will get you ahead, why would you not want to spend the most, if you can afford it?

There is only one way through this, in the long run: work out a way forward that doesn't make you dependent on this cycle. If you can compete at all, without the spend, what happens is: they burn money and you don't.

FWIW so far I don't think the benchmarks prove very much about the actual experience, and you can discover this just as easily without spending any money. And we know this about benchmarks! Once a benchmark seems useful as a measurement, it becomes a target and it stops being as useful.

I think your strategy is right. It requires bravery, and as you say, it requires ego balance. But I believe it is obvious that the world will either come around to a more sensible, stable pattern or it doesn't matter either way because we're fucked. So opting out of this mad early cycle and choosing to be calmer and happier is a choice you can just make.

reply

upvote

by nl13 hours ago|

[-]

> most halfway decent models can write damn good code for a fraction of the price.

The difference is how the model is used.

With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"

With the lessor models the code is fine, but they need something else to plan what needs to be done.

GLM-5.2 is the third model (after Opus 4.6+ and GPT-5.5) that can do this agentic style work.

Notably Gemini 3.1 Pro is notoriously bad at this style work - the code is good, but it drifts off task most of the time. 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.

reply

upvote

by jfaat13 hours ago|

[-]

My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities, it requires a little bit of work on a harness, a little bit more of my input, a little more of my brainpower. I _want_ to build tools that make it work better and don't change when the CC team gins up some default for their harness and foists it on me. I don't see that as a tradeoff at all and I think engaging in my work process more than fire and forget (and literally always in my experience fix stuff later) is more fun and rewarding once the 'holy shit this is now possible' high wears off. Doubly so once the frontier model gets nerfed mid-cycle and now I have to undo the mess because they released v*.x++ and I fell for it again by trusting it to do these agentic tasks without my involvement.

reply

upvote

by theptip4 hours ago|

[-]

> My whole point is that I don't want it to build an entire feature from one prompt

You are free to do you. But you were asking about why others want the best model.

The answer is, clearly, agentic coding (ie multiple agents each cranking through tasks independently) lets you ship A LOT more business value if used correctly.

reply

upvote

by pimeys12 hours ago|

[-]

Yep. I've tried to use the models to build large things for me. You can't trust the code it produces. Even if it works there are parts that are hot garbage, and will bite you later on. I've found out that having an editor open, asking it to implement things until a certain point, manually fixing some of the worst things it generates, then asking it to expand from there is much better than just prompting a thing and pushing to production.

And hey, don't get me wrong, you can get pretty far with just prompting. But the subtle misses and (I'm looking at you GPT) the overengineered 20k line PRs to do a simple thing are going to cost you a lot if you're not vigilant.

reply

upvote

by nl6 hours ago|

[-]

> My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities

I don't think anyone is stopping you. This is an entirely valid way of working.

I for one am glad to leave that behind me. The sooner I never have to write another line of code the better (professional software engineer for nearly 30 years here, for context).

reply

upvote

by semi-extrinsic54 minutes ago|

[-]

I don't know about you guys, but half of the time I give Opus something actually complicated, it spends 50+ minutes trying to understand the problem, running lots of searches and tool calls, and then gives up and just writes a brief summary of what it thought about. Biggest waste of tokens you can imagine.

reply

upvote

by seviu10 hours ago|

[-]

I would say 3.5 flash is great if you use a good open harness. I use omp for that. The thing with Google is that they announce they have a great model, and that they have been testing it internally for half a year. I guess they don't care too much about who or how he uses it.

I am still struggling how to deal with sub agents and different roles for each model. I still think Claude or Codex are overall better models, but everything around them transpires such weird vibes, including, and this one kills me, that at certain times they feel like dumbed down.

I keep changing these things often, but I have basic subscription to codex (20$ plan) which I use with GLM 5.2 to do some high level planning of what I intend to do, and then leave Deepseek do the coding. Or something along those lines.

Point is, GLM 5.2 is now at a point where I cannot tell you if it's better or worse. I can tell you however one thing: no matter when I use it, it's consistent in what it does and how it works.

Then there is the Fable thing, but as with many things, I think the past has distorted the reality. It lasted two days, but Anthropic said it clearly for plan users it would only be there for two weeks. It was great for doing what you can already do with other tools: doing all the planning, and reviews, and launching a million subagents talking to each other. I sometimes wonder if it was really a new model, or just Opus 4.9 wrapped with some fancy model driven harness.

reply

upvote

by nl9 hours ago|

[-]

Big fan of Amp but pretty sure it only uses Flash for search: https://ampcode.com/models

As for Fable: I used it as much as I could while we had it.

It was a step change over Opus with my work.

reply

upvote

by swiftcoder11 hours ago|

[-]

> With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"

I've had no trouble getting the current generation of smaller models to do the same thing. Maybe it's more of a harness issue than a model issue?

Recently I've used both MiniMax M3 and DeepSeek V4 Flash to one-shot moderately complex applications from a written spec, and neither one got lost along the way

reply

upvote

by NitpickLawyer13 hours ago|

[-]

> 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.

Price and speed, for me. GLM5.2 is "good enough" for some tasks, but rather slow (on their coding plan). In the time it takes GLM to "read files to figure out...", gemini flash is usually finished. It's not SotA for coding, but it's fast and often "good enough" for normal tasks.

reply

upvote

by nl6 hours ago|

[-]

> Price and speed, for me.

For Flash 3.5?

I'm a big fan of Gemini 3.1 Flash Lite Preview (yes that is the name..).

I keep a agentic SQL benchmark up to-date to test new models. It's more-or-less saturated above 23/25 but below that is still useful, and even at that level is good for comparing speed, cost and toke efficiency.

3.1 Flash Lite Preview scores 22/25 in 142 seconds for $0.02. That's a great result if you care about cost for performance.

3.5 Flash scores 20/25 in 367 seconds for $0.76. The slow speed is because it takes a lot of tokens to generate its results, so even if tokens are produced quickly it takes too many to get a positive result.

There's nothing I've seen or heard that indicates 3.5 Flash is better than this indicates.

https://sql-benchmark.nicklothian.com/?highlight=google_gemi.... vs https://sql-benchmark.nicklothian.com/?highlight=google_gemi... (click the cells to see the traces)

reply

upvote

by 5 hours ago|

[-]

deleted

reply

upvote

by andai15 hours ago|

[-]

Yeah, the funniest thing about everyone freaking out about Fable's capabilities recently was that for most of the stuff they were amazed by, you could get roughly the same result from DeepSeek Flash.

I used to be obsessed with what's the best model. Then a while back when the new best model came out, I tested it on a task. I also tested its little brother (much smaller model from same company).

They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...

reply

upvote

by Ladioss11 hours ago|

[-]

"Best model" discourse always remember me of my days in Monster Hunter with people who refused to consider playing with anything other than the meta set for their weapon and then proceed to immediately cart right at the beginning of the hunt :)

With the wealth of models available (open source vs closed, api vs local), I find optimizing the cost-efficiency of your token consumption an important part of business-oriented AI engineering. You don't need "the best" for every task.

reply

upvote

by cdud39 hours ago|

[-]

A lot of the monetarization strategies for LMM's depend on the need to use them via SaaS subscriptions. If companies start to realize that local AI is cheaper, provides good enough results and makes them independent then that monetarization strategy falls apart and a whole industry collapses.

reply

upvote

by realusername14 hours ago|

[-]

> They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...

Same for me, I certainly don't have the same definition of success and failure either.

A more expensive model has *less* rooms for wandering around than a cheaper model.

If Claude wanders around during 10min until finding the most obvious solution, then I count it as a failure.

reply

upvote

by maherbeg4 hours ago|

[-]

I would say one thing I've enjoyed about the latest frontier models from US labs is that you just work at a higher level of abstraction. You can talk about the end goal and it'll just rip. You'll add scaffolding to constrain the patterns etc, but I do way less baby sitting than I expected on 5.6 vs 5.4 vs Deepseek v4 Pro.

reply

upvote

by peheje11 hours ago|

[-]

Reason people want the best: people want to believe their project is so advanced that they need the most clever LLM possible. To say otherwise is to admit that it's not really frontier or novel in any way. And people don't like that.

reply

upvote

by BugsJustFindMe1 hours ago|

[-]

> I'm trying to wrap my head around exactly why so may people seem to want the best model available

To me this is a "more expectations mean more disappointment" situation.

Some people have higher expectations than others, and even the best model available is not good enough for what those people really want it to do once you start digging. In that light, the goal is not using the best model, but rather using the least insidiously deficient model.

Many people chase the edge because it's the least disappointing.

> when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.

The fatuousness of this statement pretty quickly becomes apparent if you spend more time looking at it, IMO, because the venn diagram of "damn good" and "not nearly good enough" strongly overlaps. Even the best model writing excellent lines of code still has noticeably deficient ability to decide which excellent lines of code to write. The goal is to improve the separation between them, not save a few dollars, because the emotional effort is worth more to us than the money.

> And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable.

Your minimization of performance differences and maximization of stability differences is exposing your biases.

Side note: I think you should know that to me at least some of what you said reads like self-rationalized moralizing. I couldn't help but imagine Principal Skinner saying "Am I so out of touch? No, it's the children who are wrong." People don't only want different things than you do because they don't know what they're doing.

reply

upvote

by YmiYugy12 hours ago|

[-]

I’m writing a lot of React code and find that the cheaper models are pretty terrible. Maybe I’m holding it wrong but the experience that the cheaper model is usually enough just track with my experience. Worse, I find predicting the difficulty of tasks exceedingly difficult. More often than not using the initially cheaper models requires me to reroll with a more expensive one or waste a lot of times and tokens cleaning up the subpar results. With OpenAI and Anthropic still subsiding tokens, not using the best models still seems like a tough ask.

reply

upvote

by blobbers2 hours ago|

[-]

What happens when you find the models are terrible? The claimed results don't match? My dev cycle tends to be write a test for blah blah, add feature to satisfy test, make sure tests pass.

reply

upvote

by ragebol4 hours ago|

[-]

I'm using DeepSeek v4 Flash through OpenCode and OpenRouter, and works just fine. It's not the bottleneck, I am, for what I'm building. That involves understanding the problem I'm solving, checking correctness

Meanwhile, it's such a cheap model that I've spent not even $25 over 3 weeks.

reply

upvote

by treebrained6 hours ago|

[-]

For math, even the frontier has shortcomings, and there is a steep drop from GPT 5.5 xhigh to anything else. The time wasted by less-than-SotA just isn't worth it.

reply

upvote

by grosswait8 hours ago|

[-]

Because not every problem is a coding problem or not entirely solvable through code. Other tasks include legal, philosophical, financial, investigative, and combinations of these and others.

reply

upvote

by cicko7 hours ago|

[-]

It doesn't look like that's where the conversation was going, though.

reply

upvote

by cik14 hours ago|

[-]

I've landed in a similar place by reducing effort and cutting up tasks. I find that more exacting specifications to the models, yield significantly less need for "effort". Combining each with multjple git worktrees and an integration branch for the current worktrees themselves has yielded incresible results.

This also allows me to play with, and mix codex, claude cli, and others. This is my happy spot for the last two months.

reply

upvote

by jfaat13 hours ago|

[-]

Yeah this is sounds close to my workflow and its good to hear you've find a nice flow too! It frees me up to spend that effort on doing more things in parallel and focusing way more on the specs which is usually a good idea anyway.

reply

upvote

by andix6 hours ago|

[-]

I don't drive the best car available on the market. I don't own the fastest and best PC/Laptop/Smartphones available. I don't live in the best house in my city. I made reasonable choices that balance my needs and my available budget.

reply

upvote

by Anonyneko6 hours ago|

[-]

>why so many people seem to want the best model available

In my case, I rarely ever go over the Claude/ChatGPT subscription limits, so might as well use those considered-best models. If I had to generate millions of lines of code, maybe I would've used the open models more.

reply

upvote

by ifwinterco12 hours ago|

[-]

I think people are grouping into two flows.

One group is trying to get the LLM to basically one shot everything and not properly reviewing the output.

Others are using the LLM to assist their human intelligence in a tight loop.

If you’re doing the former you really do need the best model available because that’s still right on the edge of what LLMs can do at best, and at worst you’re just shipping pure unmaintainable slop.

If you’re doing the latter then you can get away with a slightly less powerful model without it making a material difference because your human intelligence is filling in gaps

reply

upvote

by Foobar85687 hours ago|

[-]

The later takes too much mental ressources, the same when reviewing truly the code generated by the former.

I generally started by reviewing but after a while (maximum in hours), I just can't keep up and resort to LLMs as sole reviewers.

reply

upvote

by sourcecodeplz2 hours ago|

[-]

not many want to admit this

reply

upvote

by marcyb5st12 hours ago|

[-]

Well put. I belong to the latter group as I feed small, granular tasks that I describe thoroughly to the LLM. I tried, however, to just give it a bigger scope task. Even best models produce sloppy code.

While the single functions/classes/structs/... can be well though out the code tends to lack cohesion, and especially maintainability. For instance, it never thinks: "I could put this logic in an interface/trait so that if the requirements change I can simply add a concrete implementation that satisfies the new requirements (and potentially use one of these for testing)".

reply

upvote

by ifwinterco10 hours ago|

[-]

Yes that's also my experience.

SoTA models can do reasonably good jobs on each ticket, but over time the architecture of the application starts degrading without a human in the loop.

The entropy increases slower with better models but the trend is always towards slop

reply

upvote

by dsrtslnd2311 hours ago|

[-]

I agree, but there are use cases for the 'best model' other than converting your 1975 stuff to rust: for use cases where LLMs are just getting started to be useful I really want to use the current 'best' model: e.g. CAD, PCB design etc. In particular anything which requires spatial reasoning. The short time I had access to Fable 5 - it was just way better than any other model.

reply

upvote

by dofm7 hours ago|

[-]

Except that there is no application for AI in CAD that is better, more appropriate, more robust or more sensible than learning how to use a CAD package and doing it yourself.

It's not fast-changing, it's not abstract, it's just not that difficult, and where it is difficult, the AI cannot help you, because it is not capable of things you are capable of.

Learn CAD yourself. Honestly; I was sure I would never manage to learn CAD but it turns out to be interesting, rewarding, valuable and actually quite quick to learn.

An LLM certainly is not going to be able to do it better than you once you have a tiny bit of experience. (PCB design, perhaps, has a language to it that an LLM can make a bit more headway into, but as a non-PCB-designer I would still bet that it's more like CAD than code)

reply

upvote

by timacles5 hours ago|

[-]

This is a refreshing perspective because recently I feel like I’m surrounded by people who think they can effectively implement complex software, just by hammering the best models.

It has been hard to explain that they are in fact just creating toy versions and there is no way they can do it without learning the underlying architecture. But they just keep going wasting 100s of dollars , lost in a sea of bugs

reply

upvote

by dofm4 hours ago|

[-]

Until a few years ago I'd have been the person who thought you could make a text-to-CAD system scale up to all of it. And then I tried to make stuff I wanted.

Dabbled with OpenSCAD as we will. I decided to learn FreeCAD and what I discovered is that, even putting aside FreeCAD's many documented issues, parametric GUI CAD is not an imprecise, clumsy or fiddly way to work.

It is expressive, precise, generally capable of all the things that code-CAD can do and much more, and it's much, much quicker to work in, once you've learned a few core principles.

As you say, there is an underlying architecture; it's not just a sort of 3D paint package.

The problems the text-as-whatever crowd have are all Dunning-Kruger things in the truest sense.

People who are unaware they are unskilled in a particular technology are unlikely to successfully replace it with another. Particularly one that requires describing the problem domain in precise language.

Quite often when you see text-to-CAD discussions, especially here, there's evidence of profound misunderstandings from the people who think they are going to automate it. They assume their frustrations with the tools stem from limitations of the tools, not from the limits of their understanding.

As a person with decades of experience of code I have found learning how to use LLMs effectively to be much, much harder than learning CAD.

reply

upvote

by mschuetz6 hours ago|

[-]

For me, the 20€/months subscriptions were always sufficient, and it's nice if that subscription give the latest and greatest results.

reply

upvote

by darkstar_1611 hours ago|

[-]

It's also geeks and engineers using these models and being the most vocal. We always think we're special and need the extra horsepower. Ever been on one of those home lab subreddits ? Same story.

reply

upvote

by neongreen6 hours ago|

[-]

> I'm trying to wrap my head around exactly why so may people seem to want the best model available

I've been programming since I was a kid. I enjoy it a lot, I like knowing how things work, I get excited about new compiler features, I stayed up every night for a week when I discovered Lean 4, etc etc etc.

At the same time I realized a few years ago that I just don't want to write any code ever. Or read any code. Coding is addictive and fun, but I'd rather talk to the computer and have things magically get done. (FWIW learning how to use LLMs feels more.. fulfilling, too)

Anyway. GLM 5.2 is nice and all, but I might have to spend half an hour guiding it to come up with a plan I'm happy with. And with Opus it could be 15 minutes. I'm still going to spend an hour talking to LLMs one way or the other, but with Opus it will be a less frustrating hour. If Fable gives me a frustration-free hour, I'll switch to Fable.

reply

upvote

by enraged_camel4 hours ago|

[-]

>> I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price.

The reason is pretty simple and has to do with statistics: on long-horizon tasks, small errors and deviations from the "good path" compound.

reply

upvote

by miroljub10 hours ago|

[-]

Of course people want the best model available, even at 10x costs, if they are not paying for it. If the company is paying, why wouldn't you want a 2% better model?

That changes as soon as the developer is the one paying for a model. Then it's a classical engineering trade-off between money and quality, and that's where open models are clear winners.

reply

upvote

by ssk4216 hours ago|

[-]

What is your favorite harness for the open weights?

reply

upvote

by jfaat15 hours ago|

[-]

We built our own and aren't done open sourcing it but before that I got to a really good place with opencode plus some custom agents, pi family is good too although I haven't used it as much. We made an agent to design a spec, one to implement by dispatching subagents, one to validate against the plan, things like that. All of this helps claude/gpt too IME. For open models it has helped them stay out of loops (e.g. Kimi's but WAIT) and for frontier it helps them stay on task and not invent bloated patterns

reply

upvote

by SeriousM12 hours ago|

[-]

pi is great for learning, oh-my-pi has all the nice things included that I've built fory pi previously.

reply

upvote

by NamlchakKhandro15 hours ago|

[-]

pi-mono

reply

upvote

by ithkuil12 hours ago|

[-]

What is pi-mono ? (I heard about pi)

reply

upvote

by re-thc9 hours ago|

[-]

> most halfway decent models can write damn good code for a fraction of the price

The problem isn't what they do in a blank state. It is how they get there and the edge cases. Some models also take longer (uses more steps) i.e. end up costing more despite being "cheaper".

I've seen models:

- Back out plans non-stop. Tried the obvious path. Invents X/Y/Z excuse (without verifying) that it can't be done. Notes that down and moves on. It could be as simple as site A being down and to download from site B but that's it.

- Hacks the test to make it work. Code is wrong? Nah, let's update the test.

- Keep saying useless things like YAGNI and infinite excuses like too risky to never do the work.

- Claims they are done but there's 100 edge cases not covered. When you try to use it it fails in ways you as a human assume it should work. You can write a spec to cover it all but then what's the point?

- Be trigger happy and never investigate. Tries to do it. 5 minutes. Oh it failed. Back out. Repeat. Better models definitely spend more time analyzing and actually "think". I've had models spend hours trying to do a change due to this method when an actual investigation (code walkthrough) might have solved it.

- Know and use the right tools. A lot of lesser models have infinite fear e.g. oh docker might not be available (it is) or this and that (even if you nudge it in any way) and spend a lot of extra time "working around" it.

The list goes on. Better models definitely help.

Only thing to agree on is no you don't need Fable but saying Sonnet can do the job instead of Opus is a different story. It's so obvious when Sonnet touches the code that I can't give it more than 5 minutes. It lies. Doesn't check. Forgets things and then messes up.

reply

upvote

by secrooq7 hours ago|

[-]

[flagged]

reply