upvote
I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find that it completely sucks at following directions.
reply
Yeah, I share the same sentiment. I have yet to find a task where gpt-5.4-mini isn't bordering unusable.
reply
I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked.

I think that buys enough credibility to propose an alternative.

I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.

reply
It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price.
reply
This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.
reply
Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.

https://github.com/datacurve-ai/deep-swe

reply
[flagged]
reply