undefined

points

[-]

Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?

by XCSme1 days ago|

parent|

[-]

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.

by jaggs1 days ago|

parent|

[-]

It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

by BoorishBears15 hours ago|

parent|

[-]

This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???

by jaggs15 hours ago|

parent|

[-]

Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

by BoorishBears2 hours ago|

parent|

[-]

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.