undefined

points

[-]

Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!

by epolanski7 hours ago|

parent|

[-]

I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution.

by celrod1 hours ago|

parent|

[-]

Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance? That is what I want. I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back.

by happyPersonR5 hours ago|

parent|

prev|

[-]

more thinking == more tokens === more money LOLL

by overfeed3 hours ago|

parent|

[-]

Os there a cost benchmark out there? I wonder how frontier models are doing over time for cost per problem solved.

by drob5183 hours ago|

parent|

prev|

[-]

I think they are optimizing for one-shot performance because that will drive usage. They can’t afford to look bad in the benchmarks. And if that means consuming an order of magnitude more tokens, well, that’s good for business, too.

by 8 hours ago|

prev|

[-]

deleted