upvote
Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.
reply
It is the other way round.

In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.

Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.

reply
The cost is far from linear though. Because of prompt caching and the fact that generally output tokens are a lot more expensive than input tokens.
reply
Agreed that it is not linear.

I wrote my own agent, and it sends data to LLMs in this order: "General Prompts (How to write good code)" + "The Code" + "The Feature Request". This means the KV cache will be used even when the feature request changes.

And output tokens are usually way less than the input tokens.

So I think that my approach is very lightweight on token usage compared to an interactive session.

It would be interesting to measure it for the other agents out there. Sending a feature request two times vs an interactive session.

reply
"Make the button red" probably doesn't need an LLM at all.
reply
One tends to use LLMs for everything in practice. It‘s inconvenient to switch mode of operation
reply
That’s usually not true due to caching. It may be true if you leave a large gap in between, but if you send “make it red” right after, then it’s purely incremental
reply
Probably like 1% of the energy an average person spends on driving.
reply
Average american is what you mean
reply
The cost is nothing compared to the outcome and time savings. What I see is that people with no money want to jump into this pool but they aren't having a good time. That is generally the case when you are poor.
reply
come on now, we can't just not escape the permanent underclass by using our brains, we've also got to use up all the resources while doing it.
reply