undefined

points

[-]

SWE-REbench should not be gameable. They collect new issues from live repos, and if you check 1-2 months after a model was released, you can get an idea. But even that would be "benchmaxxxable", which is an overloaded term that can mean many things, but the most vanilla interpretation is that with RL you can get a model to follow a certain task pretty well, but it'll get "stuck" on that task type, or "stubborn" when asked similar but sufficiently different tasks. So for swe-rebench that would be "it fixes bugs in these types of repos, under this harness, but ask it to do soemthing else in a repo and you might not get the same results". In a nutshell.

by underlines23 hours ago|

prev|

[-]

well, your own, unleaked ones, representing your real workloads.

if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.

by WarmWash23 hours ago|

prev|

[-]

ARC-AGI 2

GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.

by cbg021 hours ago|

prev|

[-]

None. Try them out with your own typical tasks to see the performance.