upvote
>but it didn't perform well in our coding and reasoning testing

>Comprehensive evaluation results at https://gertlabs.com/rankings

But if you go to the linked site, it seems like the only thing that's part of the evaluation is how well the models play various games? I suppose that counts as "reasoning", but I don't see how coding ability tested?

reply
Games is loosely defined here, as we run the bench across hundreds of unique environments. For some, the models write code to play a game, either one-shot or via a harness where they can iterate and use tools. Some they play directly, making a decision on each game tick. Some are real-time, giving the models a harness where they can write code handlers or submit decisions to interact with environments directly.

Coding is what we test for most heavily. Testing this via a game format (instead of correct/incorrect answers) allows us to score code objectively, scale to smarter models, and directly compare performance to other models. When we built the first iteration last year, I was surprised by how well it mapped to subjective experience with using models for coding. Games really are great for measuring intelligence.

reply
GLM-5.1 does not support image input.
reply
This may be a strange request, but is it at all possible to include Cursor's Composer models in your tests?
reply
I am curious about the model, but for the most part, we have access to the same models that you do and only test models with standalone API releases.
reply
I think the point is to use them both with GLM 5.1 delegating vision tasks to GLM-5V-Turbo
reply