You can’t trust it that a model that scores 93% is better at software engineering than a model that scores 90%, because at that point it’s impossible to distinguish between recall and reasoning.
40% vs 90%? Sure.
70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.
But how do you know the model was over-optimized for it or just really good?
It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.