undefined

points

[-]

My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities, it requires a little bit of work on a harness, a little bit more of my input, a little more of my brainpower. I _want_ to build tools that make it work better and don't change when the CC team gins up some default for their harness and foists it on me. I don't see that as a tradeoff at all and I think engaging in my work process more than fire and forget (and literally always in my experience fix stuff later) is more fun and rewarding once the 'holy shit this is now possible' high wears off. Doubly so once the frontier model gets nerfed mid-cycle and now I have to undo the mess because they released v*.x++ and I fell for it again by trusting it to do these agentic tasks without my involvement.

by theptip4 hours ago|

parent|

[-]

> My whole point is that I don't want it to build an entire feature from one prompt

You are free to do you. But you were asking about why others want the best model.

The answer is, clearly, agentic coding (ie multiple agents each cranking through tasks independently) lets you ship A LOT more business value if used correctly.

by pimeys12 hours ago|

parent|

prev|

[-]

Yep. I've tried to use the models to build large things for me. You can't trust the code it produces. Even if it works there are parts that are hot garbage, and will bite you later on. I've found out that having an editor open, asking it to implement things until a certain point, manually fixing some of the worst things it generates, then asking it to expand from there is much better than just prompting a thing and pushing to production.

And hey, don't get me wrong, you can get pretty far with just prompting. But the subtle misses and (I'm looking at you GPT) the overengineered 20k line PRs to do a simple thing are going to cost you a lot if you're not vigilant.

by nl6 hours ago|

parent|

prev|

[-]

> My whole point is that I don't want it to build an entire feature from one prompt. At most, I want to work with an agent to nail down the spec and then work with an agent that orchestrates the implementation via other agents, same for testing, etc. None of that requires frontier capabilities

I don't think anyone is stopping you. This is an entirely valid way of working.

I for one am glad to leave that behind me. The sooner I never have to write another line of code the better (professional software engineer for nearly 30 years here, for context).

by semi-extrinsic53 minutes ago|

prev|

[-]

I don't know about you guys, but half of the time I give Opus something actually complicated, it spends 50+ minutes trying to understand the problem, running lots of searches and tool calls, and then gives up and just writes a brief summary of what it thought about. Biggest waste of tokens you can imagine.

by seviu10 hours ago|

prev|

[-]

I would say 3.5 flash is great if you use a good open harness. I use omp for that. The thing with Google is that they announce they have a great model, and that they have been testing it internally for half a year. I guess they don't care too much about who or how he uses it.

I am still struggling how to deal with sub agents and different roles for each model. I still think Claude or Codex are overall better models, but everything around them transpires such weird vibes, including, and this one kills me, that at certain times they feel like dumbed down.

I keep changing these things often, but I have basic subscription to codex (20$ plan) which I use with GLM 5.2 to do some high level planning of what I intend to do, and then leave Deepseek do the coding. Or something along those lines.

Point is, GLM 5.2 is now at a point where I cannot tell you if it's better or worse. I can tell you however one thing: no matter when I use it, it's consistent in what it does and how it works.

Then there is the Fable thing, but as with many things, I think the past has distorted the reality. It lasted two days, but Anthropic said it clearly for plan users it would only be there for two weeks. It was great for doing what you can already do with other tools: doing all the planning, and reviews, and launching a million subagents talking to each other. I sometimes wonder if it was really a new model, or just Opus 4.9 wrapped with some fancy model driven harness.

by nl9 hours ago|

parent|

[-]

Big fan of Amp but pretty sure it only uses Flash for search: https://ampcode.com/models

As for Fable: I used it as much as I could while we had it.

It was a step change over Opus with my work.

by swiftcoder11 hours ago|

prev|

[-]

> With Opus you can give it a long-horizon task (eg build an entire feature) and it will plan it out and implement it and almost always stay on task. This is what people mean when they say "agentic tasks"

I've had no trouble getting the current generation of smaller models to do the same thing. Maybe it's more of a harness issue than a model issue?

Recently I've used both MiniMax M3 and DeepSeek V4 Flash to one-shot moderately complex applications from a written spec, and neither one got lost along the way

by NitpickLawyer13 hours ago|

prev|

[-]

> 3.5 Flash is supposed to address this, but I haven't had a good reason to try it.

Price and speed, for me. GLM5.2 is "good enough" for some tasks, but rather slow (on their coding plan). In the time it takes GLM to "read files to figure out...", gemini flash is usually finished. It's not SotA for coding, but it's fast and often "good enough" for normal tasks.

by nl6 hours ago|

parent|

[-]

> Price and speed, for me.

For Flash 3.5?

I'm a big fan of Gemini 3.1 Flash Lite Preview (yes that is the name..).

I keep a agentic SQL benchmark up to-date to test new models. It's more-or-less saturated above 23/25 but below that is still useful, and even at that level is good for comparing speed, cost and toke efficiency.

3.1 Flash Lite Preview scores 22/25 in 142 seconds for $0.02. That's a great result if you care about cost for performance.

3.5 Flash scores 20/25 in 367 seconds for $0.76. The slow speed is because it takes a lot of tokens to generate its results, so even if tokens are produced quickly it takes too many to get a positive result.

There's nothing I've seen or heard that indicates 3.5 Flash is better than this indicates.

https://sql-benchmark.nicklothian.com/?highlight=google_gemi.... vs https://sql-benchmark.nicklothian.com/?highlight=google_gemi... (click the cells to see the traces)

by 5 hours ago|

prev|

[-]

deleted