Hacker News
new
past
comments
ask
show
jobs
points
by
primaprashant
7 hours ago
|
comments
by
stared
6 hours ago
|
[-]
SWE-bench Verified is, at this point, contaminated
https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
So it os hard to tell how much of a model gain is due to skill, and how much - overfitting.
reply