undefined

points

[-]

The only benchmarks that matters is your actual task.

I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.

There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)

As far as they go, though, these harder benchmarks match my experience more closely:

Where we see "top" models drop way down in score when given longer tasks.

That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)

By the time I'm done testing all the Chinese models, they'll be obsolete :)

[-]

According to reports in this thread it is somewhere between Opus 4.7 and 4.8. This is effectively frontier.