undefined

points

[-]

Only if you didn't read the article.

They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.

It's not a "our models are so good that the benchmark is too easy" thing.

by embedding-shape22 hours ago|

parent|

[-]

I feel like they're quite open about why they think the benchmark doesn't work anymore:

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

by f33d517322 hours ago|

parent|

prev|

[-]

> without bringing in proof

Did we read the same article?

by MattRix22 hours ago|

parent|

prev|

[-]

How can you say “without bringing in proof” when there is literally proof in the article?