upvote
There's an asterisk right below that table stating that:

> *Anthropic reported signs of memorization on a subset of problems

And from the Anthropic's Opus 4.7 release page, it also states:

> SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals. Excluding any problems that show signs of memorization, Opus 4.7’s margin of improvement over Opus 4.6 holds.

reply
Was 4.7 distilled off Mythos (which got 77.8%)? Interesting how mythos got 82% on terminal-bench 2.0 compared to 82.7% for GPT-5.5.

Also notice how they state just for SWE-Bench Pro: "*Anthropic reported signs of memorization on a subset of problems"

reply