undefined

upvote

points

by zzleeper5 days ago |

upvote

by bfeynman5 days ago|

[-]

Given it was made by cognition (team behind devin flop) who now just got to wait out until claude and gpt5 basically do all of the work for them - not very. When you read about it, the framework is highly subjective. Which very quickly becomes a problem because its based on heuristics that probably change a bunch with a better code model.

reply

upvote

by vanuatu5 days ago|

[-]

the subjective framework is exactly why its good

prior bms relied mostly on unit tests or synthetic judges which are easily benchmaxxed, which leads to nobody trusting benchmarks

we need people manually checking the data for good code quality

reply

upvote

by vanuatu5 days ago|

[-]

i worked on one of the benchmarks typically found in new model releases

this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)

reply

upvote

by Catloafdev5 days ago|

[-]

It's a relatively new benchmark but from what I can tell it has serious cred behind it. I assume it will be picked up as part of the standard suite of CS-related benchmarks soon enough.

reply

upvote

by emp173445 days ago|

[-]

Seems like it literally popped up yesterday with the express purpose of building hype for this release.

reply

upvote

by osti5 days ago|

[-]

And notable absence of DeepSWE benchmark where they do badly, but somehow a benchmark that was published yesterday is in this announcement.

reply

upvote

by zzleeper5 days ago|

[-]

Exactly.. a bit of a red flag for me..

reply

upvote

by swyx5 days ago|

[-]

team member here - we had been working on frontiercode for ~6-7months. timing just lined up

reply

upvote

by emp173445 days ago|

[-]

Yeah, right. If this benchmark was truly developed in an independent manner, and the timing just “lined up”, how did Anthropic even know to include results in their model release documentation the day after the benchmark is revealed? It seems like there must have been some collaboration or influence from Anthropic behind the scenes.

reply

upvote

by oblio5 days ago|

[-]

Come on, why are you a jerk about this?

Nobody would have 800+ billion reasons to lie by commission or omission here.

reply

upvote

by vanuatu5 days ago|

[-]

i doubt it, cog wants coding agents to be better because it directly improves their product

they aren't married to a particular lab, most of their usage is their in house model i believe

reply

upvote

by anthonypasq5 days ago|

[-]

what incentive does Cognition have for doing this? seems like complete nonsense speculation on your part.

reply

upvote

by bel85 days ago|

[-]

With billions/trillions of dollars floating around, is it hard to imagine benchmarks could be biased?

I think it's safe to assume everything AI related is heavily biased until proven otherwise. Just like in pharma.

reply

upvote

by camdenreslink5 days ago|

[-]

People game benchmarks for fake internet points to get their favorite web framework to the top of the list. I'm pretty sure they will do it for billions of dollars.

reply

upvote

by anthonypasq5 days ago|

[-]

you didnt answer my question. Why would cognition be biased towards making anthropic look good?

reply

upvote

by gloosx5 days ago|

[-]

Because Cognition is a major customer of Anthropic?

reply

upvote

by anthonypasq4 days ago|

[-]

they are also a major customer of OpenAI and every other model maker. whats your point?

reply

upvote

by schipperai5 days ago|

[-]

Cognition did well in documenting their approach [1].

TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.

[1]: https://x.com/cognition/status/2064061031912288715

reply

upvote

by shimman5 days ago|

[-]

It's an unacademic benchmark by a failed VC startup clawing for relevancy.

reply

upvote

by CSMastermind5 days ago|

[-]

DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user reported results from trying the models.

reply

upvote

by ryeguy5 days ago|

[-]

Did you read the blog post? They compare to deepswe and call it out as the worst one for false positives (failed, but the benchmark assessed it as correct). It also has less language variance.

reply

upvote

by CSMastermind5 days ago|

[-]

I mean yes that is what you'd say if you were writing a blog post about your new benchmark.

reply

upvote

by ryeguy4 days ago|

[-]

Sure, but they at least quantified it with data. It's not like they just dropped a sentence saying the above, they showed numbers.

reply

upvote

by piphf4 days ago|

[-]

[dead]

reply