upvote
Of course, I’m not trying to dismiss gains from harness, actually the opposite.

But the narrative that 4.Y is an improvement over 4.X is essential to keep the model training music playing.

If 90+% of the gains come from the harness, how can you continue to justify spending billions of dollars on training and an 80% gross margin on inference on the latest model? (Reportedly what Anthropic commands on the top tier of their frontier model API billing).

So differentiating between the two (what I’m trying to do here) is really consequential!

reply
Except LLMs are simulacra of actual intelligence. Frequently in a single conversation working on a single narrowly scoped task, I am both surprised by a few insights and cursing at how it can miss obvious issues. The "raw intelligence" of LLMs leaves much to be desired.
reply