undefined

points

[-]

I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find that it completely sucks at following directions.

by gck118 hours ago|

parent|

[-]

Yeah, I share the same sentiment. I have yet to find a task where gpt-5.4-mini isn't bordering unusable.

by lordmauve19 hours ago|

parent|

prev|

[-]

I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked.

I think that buys enough credibility to propose an alternative.

I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.

by sourcecodeplz20 hours ago|

parent|

prev|

[-]

It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price.

by mordae16 hours ago|

prev|

[-]

This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.

by lordmauve8 hours ago|

parent|

[-]

Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.

https://github.com/datacurve-ai/deep-swe

by mordae8 hours ago|

parent|

[-]

[flagged]