upvote
You can read the paper here: https://labs.scale.com/papers/swe_bench_pro

TL;DR its very effective as it directly tests model on REAL codebases: "The benchmark is constructed from GPL-style copyleft repositories and private proprietary codebases". The use case is very real.

reply
It doesn't sound to me like this benchmark is attempting to measure architecture design. As far as I see in the paper, they do not evaluate the architectural quality of a task completion, only whether the model is capable of completing it at all.
reply