upvote
DeepSeek v4 Pro struggles with a custom harness, and all the models ranked above it don't, so it gets downweighted in the agentic coding benchmarks (although it ranks better than Flash in one-shot problem solving: https://gertlabs.com/rankings?ow=1&mode=oneshot_coding). We ran plenty of samples.

MiMo v2.5 is on there, as well as the pro version.

We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases.

reply
deleted
reply