upvote
Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard
reply
And personal too. Different engineers are using them for different use cases.
reply
The important point is that your benchmark is pretty much irrelevant for the actual usage. Thus whatever conclusion you draw is not just irrelevant but misleading.
reply
Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

Appreciate you sharing the results of your tests though!

reply
I appreciate the feedback!
reply