upvote
From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.

4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.

reply
Isn't SWE-Bench Verified pretty saturated by now?
reply
Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.
reply