upvote
Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!
reply
I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution.
reply
Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance? That is what I want. I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back.
reply
more thinking == more tokens === more money LOLL
reply
Os there a cost benchmark out there? I wonder how frontier models are doing over time for cost per problem solved.
reply
I think they are optimizing for one-shot performance because that will drive usage. They can’t afford to look bad in the benchmarks. And if that means consuming an order of magnitude more tokens, well, that’s good for business, too.
reply
deleted
reply