upvote
Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?
reply
If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.

reply
It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

reply
This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???

reply
Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.
reply
Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.
reply