undefined

points

[-]

For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:

https://artificialanalysis.ai/?intelligence-efficiency=intel...

Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.

by theptip7 hours ago|

parent|

[-]

This is the right way of thinking end-to-end.

Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.

by manmal9 hours ago|

prev|

[-]

Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.

by dktp9 hours ago|

parent|

[-]

The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall

by manmal8 hours ago|

parent|

[-]

Well you‘ll need the same prompt for input tokens?

by httgbgg8 hours ago|

parent|

[-]

Only the first one. Ideally now there is no second prompt.

by manmal8 hours ago|

parent|

[-]

Are you aware that every tool call produces output which also counts as input to the LLM?

by 9 hours ago|

parent|

prev|

[-]

deleted

by kalkin9 hours ago|

parent|

prev|

[-]

That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".

by SkyPuncher8 hours ago|

prev|

[-]

Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.

I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.

by the_gipsy8 hours ago|

prev|

[-]

With AIs, it seems like there never is a comparison that is useful.

by theptip6 hours ago|

parent|

[-]

You can build evals. Look at Harbor or Inspect. It’s just more work than most are interested in doing right now.

by jascha_eng8 hours ago|

parent|

prev|

[-]

yup its all vibes. And anthropic is winning on those in my book still