They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.
It's not a "our models are so good that the benchmark is too easy" thing.
> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.
Did we read the same article?