upvote
If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

reply
Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…
reply