Overall, some benchmarks show Composer doing well, others not so much. We think the model is very capable at the given price point. There's lots to improve! If you see any specific behaviors or places the model isn't very good, lmk here or can email me lrobinson at cursor.com.
The "price point" comparison is a lie though because Composer is only available with a monthly Cursor subscription, and Cursor's external-model-per-token charges for other models are not representative of what other models' monthly subscribers get. An OpenAI $200 subscription gets you at least as much GPT 5.5 as a $200 Cursor subscription gets you Composer 2.5.
Grok build only gave me roughly 10 hours of use for $40 for the entire month...
I don't even care about long horizon, can I use it a reasonable amount of time through the month? I use AI for hobby projects, Claude gets me quite far, but I tire of dropping $100 every month. I'm not sending my money to some Chinese firm that now has access to my computer.
Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.
Your skepticism is well-founded IMHO. I have found that if you are one-shotting a Django/Next CRUD app, a React/Vue UI, shell scripts or GitHub Actions, Composer 2.5 is fantastic!
But for anything outside the median of the last decade's web development - like free-body physics, kinematics, or optimization - Composer is *horribly* unpredictable.
It isn't universally trash; rather, it confidently makes subtle, incorrect assumptions. It inserts tiny footguns that require you to scrutinize *every* single token it generates.
Opus 4.8 max, on the other hand, refuses to guess, atleast the way I have set it up. If there's *any* ambiguity about the implementation or how tests should be written, it *stops* and asks me for clarification. I actually trust the output without worrying about hidden disasters and ticking timebombs.
Yes, Opus is far more expensive, but it's worth it for the time saved on review, which is our current blocker.
The real friction is that Cursor's marketing is so aggressive that the people paying the bills look at my Opus usage and demand to know why I'm not using the cheaper alternative!
It’s an impossible argument to win when the rest of the company's devs are happily building standard web apps on Composer without issue, blissfully unaware of how the model falls apart on harder engineering problems.
Fable 5 is on a league on its own. If history predicts the future, in ~6 months we should have open weight models that are competitive with Fable 5. Without considering what it will take to run such a thing, I would be extremely excited to have open access to such a capability. Great times ahead!
There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.
None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.
for most tasks is capable and very cheap, for a days worth of tasks is costing about $10
I do feel that they've really upped their game with composer this year though.