undefined

points

[-]

>Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of

Is a single thing in your post demonstrable, or are we just supposed to take your word for it? Because all of this stuff sounds laughably subjective.

by smokel7 hours ago|

parent|

[-]

Most interesting things in software engineering are (laughably) subjective.

Just check out any conversation on dynamic vs static typing, talk to a Rust zealot, or ask a backend engineer if microservices were a mistake.

It's unfortunate, and it makes it hard to have proper discussions on these subjects. It would be worthwhile to figure out how we can have more constructive arguments.

by abirch7 hours ago|

parent|

[-]

"Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?" -- George Carlin

by tekacs6 hours ago|

parent|

prev|

[-]

Thanks very much for saying this!

Frankly, it feels like we should just sidestep arguments entirely and just all contribute our messy data/reports, and then see how we can meld all of it together, to find the best answers for our individual situations.

Probably a good use of frontier AI, melding all of that!

by tekacs6 hours ago|

parent|

prev|

[-]

It's all closed code, so I don't have a great way of showing you, but this is all pretty easy to test for yourself, and a good chunk of it is fairly objective:

On performance: just grab CC + Codex and try Opus 4.8 xhigh and GPT 5.5 xhigh side by side. Ask them a trivial question about something that's already in their context. Opus will churn for 30 seconds, and GPT 5.5 will respond in about three seconds. If you try the same with Fable 5 you'll notice way better adaptive thinking than Opus (it'll quicker than Opus, even on xhigh – although often still slower than 5.5).

I have many, many times done 'Opus xhigh, Opus max and GPT xhigh all tried to implement something' – Opus max is... hours and hours. Opus xhigh is usually ~1.5-2x GPT 5.5 xhigh. This feels like a pretty straightforward generalization of the first point. Again, just try racing three agents and see what you get.

As far as 'right on the edge of what they're able to do', my specific tasks don't matter. Just find something that no matter how hard you try, with however many agents or combinations thereof, with arbitrarily detailed plans, agents can't seem to implement without massively mistakes or a hollowing-out of 'the point' of the implementation... and then try it on the 'following generation' of models. I've been doing this repeatedly with coding agents since I turned aider into a CC-like coding agent in early 2025 (this was my second one, my first modern-style coding agent was in Jan 2025): https://github.com/Aider-AI/aider/pull/3781

A couple of examples of the latter thing that I tend to work on are database internals (indexes, query planner stuff, etc.; I built the DB in full before agents, it just works on it with me), very advanced UIs (try making a beautiful Rolex-like interactive visualization of the internals of a mechanical watch with Opus and see how far it gets – not very), and 'hardcore product questions' (all agents kinda suck at schema – Fable far less than prior ones). I have dozens and dozens of these that they can't do, though.

by andai5 hours ago|

prev|

[-]

>it tends to leave big, dangerous holes hiding inside implementations unless babied.

A brainwave: perhaps GLM or DeepSeek could be integrated into the mix for the purposes of red-teaming the code. Fable has been blinded to security by design[0], and the open models are pretty decent at it.

[0] It's not clear what the situation with GPT-5.6 will be but the blog suggests similarly over-cautious safety filters.

Amusingly the posts for recent Opus releases brag that they successfully made it worse at security! "during its [Opus 4.7] training we experimented with efforts to differentially reduce these ["cyber"] capabilities"

by tekacs3 hours ago|

parent|

[-]

I definitely use GPT-5.5 as a counterpart to validate these exact sorts of things in Anthropic models' implementations, in the (now-rarer) cases where I allow Anthropic's models _to_ implement.

And yeah, it's a bit depressing to think that 5.6 might be similarly nerfed. Less secure software for us all, I guess... except BigCorps. :(

by mklarmann11 hours ago|

prev|

[-]

It’s Gartner. Top-right is where you want to be.

by 0123456789ABCDE10 hours ago|

parent|

[-]

gartner magic quadrant charts don't break the natural expectation of left-to-right, and bottom-to-top, increasing values, this charts from cursor post do.

by arcanemachiner10 hours ago|

parent|

[-]

Sounds like you're in the Trough of Disillusionment.

by pbowyer12 hours ago|

prev|

[-]

> I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.

Do you find that makes a difference in your work? I've been using 5.5 high/xhigh to optimize and benchmark a C codebase, and just reading the initial code virtually fills the first context window. A session will auto-compact 5-15 times, but it seems to do okay in spite of that because the task is mainly focused on the latest window each time.

I think for programming the strength of GPT over Opus is winning here over the context window.

by tekacs8 hours ago|

parent|

[-]

> I think for programming the strength of GPT over Opus is winning here over the context window.

On this, absolutely!

I more often use Opus for planning than for implementation. In those cases I really do need the very large context window, because the agent has to read in a bunch of my code base and a bunch of previous plan files and product context and such, to understand what we're talking about.

And then I need to go back and forth with it over a really extended period: getting into a bunch of details, asking it to load how things already work so that we can discuss options for evolution of those, etc.

For that kind of thing, compaction completely destroys its effectiveness because even if you try to serialize out all the decisions made in the conversation into a plan file, the agent still loses e.g. the plan files and code files that it's read in that are adding sharp edges to its understanding of the scope of what's being planned.

For implementation or something like what you're describing in the vein of benchmarking, often I can get away with compaction. Although even then, if the agent needs to have a lot "loaded" into its head, to implement something very, very subtle, complex or far-reaching, in those cases it can be really detrimental if it compacts.

by rc13 hours ago|

prev|

[-]

> I'm pretty baffled by their choice of axes

To put their own model out in front?

by daft_pink5 hours ago|

prev|

[-]

I agree why they reverse the x axis makes this graph very hard to understand for the casual observer.

by cherryteastain11 hours ago|

prev|

[-]

You can set GPT 5.5 to 1M context mode in Cursor but it costs more after the default 272k.

by tekacs10 hours ago|

parent|

[-]

Yeah I've done this, it's just unaffordably/impractically expensive compared to the official subscriptions :/

by 0123456789ABCDE11 hours ago|

prev|

[-]

opus@max is on average worst than opux@xhigh

for supporting evidence, see first chart here: https://www.anthropic.com/news/claude-fable-5-mythos-5