upvote
Thanks for answering this random internet guy's question. It's a bit sad that a german math prof doesn't have sufficient funds to run a few prompts. I would have paid for them for this amount of advertising. I don't like that you gave them to a silicon valley company.

On that note, the tests are very US-centric. Only one chinese model and you unfairly nerfed it by limiting it's context window, when the compressed context is deepseek v4's main innovation and even with full context it is much cheaper to run than all the others.

reply
Please indicate which other models you would like to see included. (And I agree that the context window limitations were not reasonable to have.) Finally: running this few prompts would have been $10-20k if I would have run them myself via the API. (And the company didn't asked to contribute, but I asked whether they would be willing to do so, just saying.)
reply
Kimi K2.6 and mimo 2.5 pro are ahead of deepseek v4 in other benchmarks. Anyhow, great work, the benchmark seems to show great separation, so should be very useful to improve the math capabilities of the next generation of ai. I'm more interested in the prompt engineering/orchestration and technical details (what I can do without millions), but I get that you are mathematicians, so your focus is obviously on the math. Sorry for the nagging.
reply