undefined

upvote

points

by mdasen12 hours ago |

upvote

by leerob2 hours ago|

[-]

(I work at Cursor) When Composer 2.5 launched, we initially scored very competitively on AA's composite benchmark. I believe 3rd place overall. They have recently updated to use DeepSWE, which has more of a focus on very long-horizon tasks, and Composer isn't as good at those yet. We're aware and working on this for our next model.

Overall, some benchmarks show Composer doing well, others not so much. We think the model is very capable at the given price point. There's lots to improve! If you see any specific behaviors or places the model isn't very good, lmk here or can email me lrobinson at cursor.com.

reply

upvote

by BugsJustFindMe53 minutes ago|

[-]

> We think the model is very capable at the given price point.

The "price point" comparison is a lie though because Composer is only available with a monthly Cursor subscription, and Cursor's external-model-per-token charges for other models are not representative of what other models' monthly subscribers get. An OpenAI $200 subscription gets you at least as much GPT 5.5 as a $200 Cursor subscription gets you Composer 2.5.

reply

upvote

by giancarlostoro2 hours ago|

[-]

How does it compare to a $100 Claude subscription at $60? Especially in terms of how much of it I can use, because I havent found anything that is in the US that can get me similar usage as Claude at $100 per month or less, really open to alternatives.

Grok build only gave me roughly 10 hours of use for $40 for the entire month...

I don't even care about long horizon, can I use it a reasonable amount of time through the month? I use AI for hobby projects, Claude gets me quite far, but I tire of dropping $100 every month. I'm not sending my money to some Chinese firm that now has access to my computer.

reply

upvote

by artooro2 hours ago|

[-]

I never run long horizon tasks. So Composer 2.5 is great.

reply

upvote

by forgot-my-pw1 hours ago|

[-]

Even with the new benchmark, Composer 2.5 seems to be just a bit worse than Opus 4.7. So I assume it's going to be about similar with Sonnet 5.0 at 1/6 of the cost.

reply

upvote

by ai_slop_hater1 hours ago|

[-]

Don't lie. You forked a Chinese model.

reply

upvote

by CuriouslyC3 hours ago|

[-]

Not hard to understand what's going on here. They RL'd around patterns in their data and specific capabilities, so of course they'd construct a benchmark that's aligned with the training set.

Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.

reply

upvote

by leerob2 hours ago|

[-]

(I work at Cursor) CursorBench includes many evals from actual engineering tasks from the Cursor team, which include our private codebase. This codebase is held-out from training so models haven't seen it, including Composer.

reply

upvote

by jmcqk624 minutes ago|

[-]

I can't speak to benchmarks, but I have used Composer 2.5 extensively and it's performed quite well in my real world tasks.

reply

upvote

by subhobroto3 minutes ago|

[-]

> Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Your skepticism is well-founded IMHO. I have found that if you are one-shotting a Django/Next CRUD app, a React/Vue UI, shell scripts or GitHub Actions, Composer 2.5 is fantastic!

But for anything outside the median of the last decade's web development - like free-body physics, kinematics, or optimization - Composer is *horribly* unpredictable.

It isn't universally trash; rather, it confidently makes subtle, incorrect assumptions. It inserts tiny footguns that require you to scrutinize *every* single token it generates.

Opus 4.8 max, on the other hand, refuses to guess, atleast the way I have set it up. If there's *any* ambiguity about the implementation or how tests should be written, it *stops* and asks me for clarification. I actually trust the output without worrying about hidden disasters and ticking timebombs.

Yes, Opus is far more expensive, but it's worth it for the time saved on review, which is our current blocker.

The real friction is that Cursor's marketing is so aggressive that the people paying the bills look at my Opus usage and demand to know why I'm not using the cheaper alternative!

It’s an impossible argument to win when the rest of the company's devs are happily building standard web apps on Composer without issue, blissfully unaware of how the model falls apart on harder engineering problems.

Fable 5 is on a league on its own. If history predicts the future, in ~6 months we should have open weight models that are competitive with Fable 5. Without considering what it will take to run such a thing, I would be extremely excited to have open access to such a capability. Great times ahead!

reply

upvote

by burmanm10 hours ago|

[-]

DeepSWE is slightly flawed in the sense that is uses only its own harness and that causes issues on models that are not correctly supported by it. There's huge amount of evidence that the harness plays a big role in how these models work and yet DeepSWE entirely removes that (and has probably only tested that it works fine with some favourite model of them).

There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.

None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.

reply

upvote

by extr3 hours ago|

[-]

I think DeepSWE is flawed in a different way: the tasks look like someone took a bunch of big highly technical PRs they found really well done, and inverted it into specs for agents to autistically execute. This is not really how people use agents in practice IMO. And it's why DeepSWE is so generous to OAI models, rigid task execution is the thing they're best at. I think FrontierCode matches the vibes a lot better.

reply

upvote

by famouswaffles12 hours ago|

[-]

Cursor sessions are pretty much what composer models are RL'd on. This bench and the training data are/should be basically the same distribution.

reply

upvote

by justachillguy8 hours ago|

[-]

Naturally, given it’s their benchmark they have overfitted their model somewhat to it.

reply

upvote

by muzani10 hours ago|

[-]

Anecdotally, I find Composer 2.5 to be useless. I do use light LLMs like Claude Haiku and some of Cursor's older free models, but Composer is negative productivity for me.

reply

upvote

by maxdo9 hours ago|

[-]

The opposite , I use for everything like trigger and monitor a 10 steps release process using composer , a very capable model

reply

upvote

by vorticalbox8 hours ago|

[-]

this is my finding too, i have moved to it fully for most of the plan/coding.

for most tasks is capable and very cheap, for a days worth of tasks is costing about $10

reply

upvote

by urbsgpw6 hours ago|

[-]

Same here, maybe I'm underusing it a bit, because for anything that is a bit more complex i tend to err on the safe side and go with anthropic, but i wonder if thats just a placebo effect because i pay more for it.

I do feel that they've really upped their game with composer this year though.

reply

upvote

by datadrivenangel12 hours ago|

[-]

For lighter interactive agentic coding, where you type stuff into an IDE and a minute or three later get results back for review, composer 2.5 is honestly pretty great. The results get notably worse for larger tasks though.

reply

upvote

by anon70008 hours ago|

[-]

Agreed. It’s worse than Opus of course. But Opus takes more than 10x longer to give you something to look at. I’m not kidding, I “benchmarked” a real ticket I was working on. Opus 4.7 took more than 30min. Opus 4.8 took over an hour. Composer 2.5 took 5min on the exact same prompt & local setup. My subjective review is that composer’s code was only like 10-20% worse. It still worked, it was just a bit less clean and a little more hacky. But it’s not like Opus is flawless either. At the end of the day, if it takes an hour to get to draft code I can look at and iterate on… that’s fucking impossible for me. Unless it did an excellent job. But as long as I still need to review and follow up with changes, Opus is just too slow. It’s really frustrating because it’s a lot slower than it was 6mo ago, and not noticeably better. Fable seems a step in the right direction but is $$$$

reply

upvote

by WinstonSmith849 hours ago|

[-]

that benchmark seems to match my experience. GPT 5.5 is significantly better than Opus 4.8, last time I tried composer 2.5 it was truly dumb, and Fable to me looks to be on par with GPT 5.5 but .. different overall ... The best is to have a LLM-peer-review between GPT and Opus (now Fable) for best outcome.

reply

upvote

by ciaf11 hours ago|

[-]

By the same token, Fable 5 is given a score of 77 vs 76 for GPT 5.5

reply

upvote

by apothegm8 hours ago|

[-]

Composer writes the worst, stupidest, most naive and straight up brains-dead code you could imagine. Fast and cheap is about all it’s got going for it. I mostly use it for “sort these lines alphabetically” and stuff that’s a smidge too complex for regex find/replace.

reply

upvote

by bengale13 minutes ago|

[-]

It’s starting to feel like people need to say what language/stack and problem space they’re working in. It would be interesting to see why we’re seeing such wild variance.

reply

upvote

by simondotau7 hours ago|

[-]

I primarily use composer. I wanted to build something from scratch recently and, thinking I was missing out on something, I got Opus to build it. I wasn't blown away. I gave the same prompts to composer and the code it came up with different but similar in quality. I ended up progressing with the composer code because it was easier to progress with improvements due to its faster response time.

reply

upvote

by whazor11 hours ago|

[-]

I mean, they train their model on their training data. So by it should score well on their own benchmark.

reply