upvote
For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:

https://artificialanalysis.ai/?intelligence-efficiency=intel...

Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.

reply
This is the right way of thinking end-to-end.

Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.

reply
Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.
reply
The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall

reply
Well you‘ll need the same prompt for input tokens?
reply
Only the first one. Ideally now there is no second prompt.
reply
Are you aware that every tool call produces output which also counts as input to the LLM?
reply
deleted
reply
That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".
reply
Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.

I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.

reply
With AIs, it seems like there never is a comparison that is useful.
reply
You can build evals. Look at Harbor or Inspect. It’s just more work than most are interested in doing right now.
reply
yup its all vibes. And anthropic is winning on those in my book still
reply