undefined

upvote

points

by yanis_t8 hours ago |

upvote

by DCKing6 hours ago|

[-]

The moat right now is model performance and what that means for how many tokens and additional time you spend.

I say this as a relatively frequent user of Kimi models and generally a big fan. But on not-yet-gamed benchmarks like DeepSWE, Kimi K2.6 is beaten soundly by Claude Sonnet 4.6 ($3 / $15) and even slightly by GPT 5.4 Mini ($0.75 / $4.50).

There's no question Kimi models are very good for a lot of code tasks. They're the best quality open weight model. But to get similar overall outcomes as on Sonnet/Opus, on average you'll spend many more tokens and will have to do more managing of the model. You shouldn't look at price per token, you should look at how much you pay for the entire process.

reply

upvote

by esperent6 hours ago|

[-]

I'm more interested in how much effort I have to put in, at least while I'm paying in the range of current subscriptions (so ~€100-€200 a month or so). If the prices go up much more than that I'll have to switch to caring more about token efficiency. But at current pricing the bottleneck is my attention, not model efficiency. As such, even a small improvement in model quality - and hence, a decrease in how much attention I have to spend on it - makes a big difference.

reply

upvote

by Bnjoroge4 hours ago|

[-]

I personally dont put any weight to DeepSWE. Other than 5.5 being directionally the best model, it gets the others pretty wrong in my experience. FrontierCode from cognition looks interesting

reply

upvote

by papersail6 hours ago|

[-]

I'm not sure I would put too much weight on DeepSWE as a benchmark, given that GPT-5.4-mini ended up close to Opus 4.6 there.

reply

upvote

by DCKing6 hours ago|

[-]

Any benchmark is iffy and has weird results, but this is the best we got at the moment. Most people working with Opus and Kimi would likely tell you they're much further apart than the numbers that were quoted for Kimi K2.6, and DeepSWE seems to capture that gap better.

One major thing DeepSWE has going for it is that all other benchmarks (including those quoted by MoonshotAI on this page) don't: the other benchmarks that are completely gamed. The benchmark answers are public and part of each model's training data. This benchmark may still be iffy, but at least it's not gamed.

reply

upvote

by WarmWash5 hours ago|

[-]

Somehow the internet has also forgot that cheating to get ahead in China is basically a norm and expected behavior.

reply

upvote

by DCKing4 hours ago|

[-]

American labs also use gamed and cherry-picked benchmarks extensively. Anthropic used them in their Fable announcement and avoided DeepSWE because it doesn't beat GPT-5.5 in that one. Google's numbers for Gemini 3.5 Flash recently did not at all line up with people's subjective experience using these models, and this also happened with Gemini 3.1 Pro before it.

Everybody has incentives to manipulate benchmark results to show their models in the best light.

reply

upvote

by bensyverson55 minutes ago|

[-]

Part of Anthropic's moat is Claude Cowork & Claude Code. They got coders comfortable with CC and enterprise users comfortable with Cowork, and both are creating stickiness.

The reality is that $20/$100/$200/mo feels reasonable to a lot of people relative to the value they're getting out of Claude, and if they switch to something else, there's a risk that it won't be as good, and they'll have a new tool to learn.

It's not an insurmountable moat, but don't underestimate the user experience. The iPod didn't win because it was the cheapest device or the one with the most features.

reply

upvote

by LUmBULtERA6 hours ago|

[-]

API token price is one thing, but subscriptions on Claude are a good value. Weirdly everyone says that Claude subscriptions are subsidized because of the API price, even though (1) no one actually knows Claude's cost of inference, and (2) Chinese providers are also able to provide cheap inference, so why do they think Claude can't?

I also wonder if Enterprises have deals for other API pricing that is not posted publicly, so all we see is a high API sticker price.

reply

upvote

by wuliwong46 minutes ago|

[-]

I only have knowledge of one enterprise deal but there is no discount. Which I found surprising.

reply

upvote

by michaelcampbell56 minutes ago|

[-]

> while being only marginally better.

It's only marginally better in the things it's actually comparable to. A\ models are MUCH better in many more things; eg: things Kimi/etc. didn't distill.

For those things the difference is like a cliff.

reply

upvote

by tornikeo49 minutes ago|

[-]

That's a baseless claim that borderline reads like shilling. Do you have any proof of that you wrote there?

reply

upvote

by efromvt6 hours ago|

[-]

I think the perception is that it is not 'only marginally better'; whether or not you specifically agree that perceived quality gap lets them differentiate on price.

I'd further say that there are probably enough rational actors running evals out there that the marginally better is not pure vibes for the cases where people are spending lots of money, but I only have direct line of sight to some of those eval suites. Maybe everyone is irrational and anthropic is exploiting that!

reply

upvote

by selfawareMammal48 minutes ago|

[-]

Performance. I pay for Opencode but none of the models give me Codex performance, so I have to keep my 20€ subscription+ the Opencode one

reply

upvote

by khuey6 hours ago|

[-]

I think most people who've tried them both would tell you Anthropic's models are more than marginally better than Kimi. Kimi and the other open source models may score well on SWE-bench or whatever but the gap is noticeable IMHO once you actually try to use them.

reply

upvote

by Bnjoroge4 hours ago|

[-]

It depends on what your task is and how precise your prompts are. Planning with fable or 4.8 and laying out the plan in step by step process and coding with mimo v2.5 pro or dsv4pro or qwen 3.7 max and doing a final review with 5.5 has worked really well for me for infra stuff.

reply

upvote

by smoe6 hours ago|

[-]

I reckon right now the Enterprise concern is more FOMO around the AI wave and how to retrain or replace up to hundreds of thousands of employees. I don't think cost is the main concern right now.

But if AI doesn't lead quickly to vast large scale replacement of workers as promised, I could definitely see the C-suits and their gaggle of consultants starting to ask questions about token pricing.

reply

upvote

by gruez4 hours ago|

[-]

Your question relies on the premise that Chinese companies continue releasing free models. What's "the moat" for them continuing to do that?

reply

upvote

by yababa_y7 hours ago|

[-]

I want Opus to be only marginally better, but I do mostly research engineering and its ability to not fuck up my projects is absent. Every time my credits lapse I let kimi and composer2.5 have some play and it’s basically just an excuse for me to keep playing computer because when the oai/ant credits refresh I always need to spend hours recovering from the other models either misconceptions or boneheaded eng practices. Even when I only let it touch my web games…

reply

upvote

by nullbio6 hours ago|

[-]

I think none of them having a defacto and high quality English focused cli is a big part of it. None of the Chinese models I've tried have worked well in opensource cli's. Granted, I've only tried a few, but still...

reply

upvote

by freigeist796 hours ago|

[-]

i use github copilot cli + openrouter + qwen 3.7 max and it's really much better than i expected (used to opus 4.7 at work)

reply

upvote

by Bnjoroge4 hours ago|

[-]

huh? They all work great in omp/opencode unless you mean their own native clis like kimi code

reply

upvote

by re-thc7 hours ago|

[-]

> My theory is that US enterprise just can't send data to Chinese

Lots of US providers are hosting these “open source” models so doubt that’s the problem.

reply

upvote

by benjiro30007 hours ago|

[-]

[dead]

reply