undefined

points

by phainopepla221 hours ago |

comments

by gck118 hours ago|

[-]

Yeah, I share the same sentiment. I have yet to find a task where gpt-5.4-mini isn't bordering unusable.

by lordmauve19 hours ago|

prev|

[-]

I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked.

I think that buys enough credibility to propose an alternative.

I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.

by sourcecodeplz20 hours ago|

prev|

[-]

It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price.