undefined

points

[-]

I feel like they're quite open about why they think the benchmark doesn't work anymore:

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

by f33d517322 hours ago|

prev|

[-]

> without bringing in proof

Did we read the same article?

by MattRix22 hours ago|

prev|

[-]

How can you say “without bringing in proof” when there is literally proof in the article?