undefined

points

by greenavocado18 hours ago |

comments

by Retr0id17 hours ago|

[-]

Well, that's certainly some web design.

by DetroitThrow18 hours ago|

prev|

[-]

Methodology leaves a lot to be desired in terms of understanding the tasks you've used. Being detailed about why they're more meaningful tests than the long horizon and coding tests used by other rankings is important.

False positives and poorly defined tasks/acceptance criteria have let some models have insanely inflated scores on bad benchmarks.

And sure, you can say they're not disclosed to prevent gaming, but if you're the only one who can review them then the might as well be a random number generator display with an unreadable UI.

by greenavocado18 hours ago|

parent|

[-]

You're not wrong, but the scores track with my experience switching between the proposed top variants. So there's my unscientific "evidence."