Opus isn't that expensive to host. Look at Amazon Bedrock's t/s numbers for Opus 4.5 vs other chinese models. They're around the same order of magnitude- which means that Opus has roughly the same amount of active params as the chinese models.
Also, you can select BF16 or Q8 providers on openrouter.
They do have different infrastructure / electricity costs and they might not run on nvidia hardware.
It's not just the models.
Namely, Amazon Bedrock and Google Vertex.
That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.
Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.
We were responded about 10x not 0.5x.
x86 vs arm64 could have different performance. The Chinese models could be optimized for different hardware so it could show massive differences.
Also with Nvidia you get the efficiency of everything (including inference) built on/for Cuda, even efforts to catch AMD up are still ongoing afaik.
I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware.
They are. Nvidia makes A LOT of profit. Hey, top stock for a reason.
> I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware
DS is "old". I wouldn't study them. The new 1s have a mandate to at least run on local hardware. There are data center requirements.
I agree it could still be trained on Nvidia GPUs (black market etc), but not running.
They do? Source?
But if that's true, it would explain why Minimax, Z.ai and Moonshot are all organized as Singaporean holding companies, with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China. Can't be forced to use inferior local hardware if you're just a body shop for a "foreign" AI company. ;)
They just have a China only endpoint and likely a company under a different name.
Nothing to do with AI. TikTok is similar (global vs China operations).
If Opus was 10x larger than the chinese models, then Google Vertex/Amazon Bedrock would serve it 10x slower than Deepseek/Kimi/etc.
That's not the case. They're in the same order of magnitude of speed.
It could still be 10x larger overall, though that would not make it 10x more expensive.
According to OpenRouter, AWS serves the latest Opus and Sonnet at roughly the same speed. It's likely that they simply allocate hardware differently per model.
The quantisation is shown on the provider section.
I find it a good comparison because it is a good baseline since we have zero insider knowledge of Anthropic. They give me an idea that a certain size of a model has a certain cost associated.
I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect. Current Qwen models perform as good as Sonnet 3 I think. 2 years later when Chinese models catchup with enough distillation attacks, they would be as good as Sonnet 4.6 and still be profitable.
Define "much worse".
+--------------------------------------+-------------+-----------+------------------+
| Benchmark | Claude Opus | DeepSeek | DeepSeek vs Opus |
+--------------------------------------+-------------+-----------+------------------+
| SWE-Bench Verified (coding) | 80.9% | 73.1% | ~90% |
| MMLU (knowledge) | ~91 | ~88.5 | ~97% |
| GPQA (hard science reasoning) | ~79–80 | ~75–76 | ~95% |
| MATH-500 (math reasoning) | ~78 | ~90 | ~115% |
+--------------------------------------+-------------+-----------+------------------+Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.
Ah, the "trust me bro" advantage. Couldn't it just be brand identity and familiarity?
My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.
I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.
I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.
That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.
Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.
These are not cell phone plans which the average joe takes, they are plans purchased with the explicit goal of software development.
I would guess that 99 out of every 100 plans are purchased with the explicit goal of maxing them out.
When I have a feeling that these tools will speed me up, I use them.
My client pays for a couple of these tools in an enterprise deal, and I suspect most of us on the team work like that.
If my goal was to max out every tool my client pays, I’d be working 24hrs a day and see no sunlight ever.
I guess it’s like the all you can eat buffet. Everybody eats a lot, but if you eat so much that you throw up and get sick, you are special.
Why? Because in my experience, the bottleneck is in shareholders approving new features, not my ability to dish out code.
if i hit the limit usually i'm not using it well and hunting around. if i'm using it right i'm basically gassed out trying to hit the limit to the max.