upvote
That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper, and then you use that to claim that they're 10x more efficient.

Opus isn't that expensive to host. Look at Amazon Bedrock's t/s numbers for Opus 4.5 vs other chinese models. They're around the same order of magnitude- which means that Opus has roughly the same amount of active params as the chinese models.

Also, you can select BF16 or Q8 providers on openrouter.

reply
Opus doubled in speed with version 4.5, leading me to speculate that they had promoted a sonnet size model. The new faster opus was the same speed as Gemini 3 flash running on the same TPUs. I think anthropics margins are probably the highest in the industry, but they have to chop that up with google by renting their TPUs.
reply
This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will result in higher speed.
reply
There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.
reply
> That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper

They do have different infrastructure / electricity costs and they might not run on nvidia hardware.

It's not just the models.

reply
Except there are providers that serve both chinese models AND opus as well. On the same hardware.

Namely, Amazon Bedrock and Google Vertex.

That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.

Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.

reply
Deployments like bedrock have no where near SOTA operational efficiency, 1-2 OOM behind. The hardware is much closer, but pipeline, schedule, cache, recomposition, routing etc optimizations blow naive end to end architectures out of the water.
reply
And Microsoft's Azure. It's on all 3 major cloud providers. Which tells me, they can make profit from these cloud providers without having to pay for any hardware. They just take a small enough cut.

https://code.claude.com/docs/en/microsoft-foundry

https://www.anthropic.com/news/claude-in-microsoft-foundry

reply
AWS and GCP both have their own custom inference chips, so a better example for hosting Opus on commodity hardware would be Digital Ocean.
reply
> Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models

We were responded about 10x not 0.5x.

x86 vs arm64 could have different performance. The Chinese models could be optimized for different hardware so it could show massive differences.

reply
[delayed]
reply
I mean GN has covered the Nvidia black market in China enough that we pretty much know that they run on Nvidia hardware still.
reply
How is this related to the inference, may I ask? Except for some very hardware-specific optimizations of model architecture, there's nothing to prevent one to host these models on your own infrastructure. And that's what actually many OpenRouter providers, at least some of which are based in US, are doing. Because most of Chinese models mentioned here are open-weight (except for Qwen who has one proprietary "Max" model), and literally anyone can host them, not just someone from China. So it just doesn't really matter.
reply
I mean sure, but in terms of cost per dollar/per watt of inference Nvidia's GPUs are pretty up there - unless China is pumping out domestic chips cheaply enough.

Also with Nvidia you get the efficiency of everything (including inference) built on/for Cuda, even efforts to catch AMD up are still ongoing afaik.

I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware.

reply
> unless China is pumping out domestic chips cheaply enough

They are. Nvidia makes A LOT of profit. Hey, top stock for a reason.

> I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware

DS is "old". I wouldn't study them. The new 1s have a mandate to at least run on local hardware. There are data center requirements.

I agree it could still be trained on Nvidia GPUs (black market etc), but not running.

reply
> The new 1s have a mandate to at least run on local hardware.

They do? Source?

But if that's true, it would explain why Minimax, Z.ai and Moonshot are all organized as Singaporean holding companies, with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China. Can't be forced to use inferior local hardware if you're just a body shop for a "foreign" AI company. ;)

reply
> with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China

They just have a China only endpoint and likely a company under a different name.

Nothing to do with AI. TikTok is similar (global vs China operations).

reply
Comparing open-source models like Qwen against Anthropic’s models is absolutely foolish. First of all, Anthropic has never disclosed the actual parameter count or architecture of their models. Second, it’s well known that these open-source models more or less distill from other models and use MoE, which allows them to run at much lower computational costs. Using Qwen as a comparison point only proves the blog post author is foolish. The article devoted such a large portion to discussing Qwen on OpenRouter, I find it hard to believe.
reply
Anthropic is obviously also aware of the benefits of MoE and distilling a larger model into a smaller one, so they could run a model of the same size as Alibaba's for the same inference cost if they want to. Or they can run a slightly larger model for slightly higher cost. They definitely aren't running a much larger model (except potentially as a teacher for distillation training) because then they wouldn't be able to hit the output speeds they're hitting.
reply
Agree, but I guess the Opus 4.6 is 10x larger, rather than Chinese models being 10x more efficient. It is said that GPT-4 is already a 1.6T model, and Llama 4 behemoth is also much bigger than Chinese open-weight models. Chinese tech companies are short of frontier GPUs, but they did a lot of innovations on inference efficiency (Deepseek CEO Liang himself shows up in the author list of the related published papers).
reply
No, Opus cannot be 10x larger than the chinese models.

If Opus was 10x larger than the chinese models, then Google Vertex/Amazon Bedrock would serve it 10x slower than Deepseek/Kimi/etc.

That's not the case. They're in the same order of magnitude of speed.

reply
They serve it about 2x slower. So it must have about 2x the active parameters.

It could still be 10x larger overall, though that would not make it 10x more expensive.

reply
I agree that Opus almost definitely isn't anywhere near that big, but AWS throughput might not be a great way to measure model size.

According to OpenRouter, AWS serves the latest Opus and Sonnet at roughly the same speed. It's likely that they simply allocate hardware differently per model.

reply
GPT-4 was likely much larger than any of the SOTA models we have today, at least in terms of active parameters. Sparse models are the new standard, and the price drop that came with Opus 4.5 made it fairly obvious that Anthropic are not an exception.
reply
Actually, Opus might achieve a lower cost with the help of TPUs.
reply
> Plus who knows what open routed providers do in term quantization

The quantisation is shown on the provider section.

reply
>It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.

I find it a good comparison because it is a good baseline since we have zero insider knowledge of Anthropic. They give me an idea that a certain size of a model has a certain cost associated.

I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect. Current Qwen models perform as good as Sonnet 3 I think. 2 years later when Chinese models catchup with enough distillation attacks, they would be as good as Sonnet 4.6 and still be profitable.

reply
> I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect.

Define "much worse".

  +--------------------------------------+-------------+-----------+------------------+
  | Benchmark                            | Claude Opus | DeepSeek  | DeepSeek vs Opus |
  +--------------------------------------+-------------+-----------+------------------+
  | SWE-Bench Verified (coding)          | 80.9%       | 73.1%     | ~90%                 |
  | MMLU (knowledge)                     | ~91         | ~88.5     | ~97%               |
  | GPQA (hard science reasoning)        | ~79–80      | ~75–76    | ~95%             |
  | MATH-500 (math reasoning)            | ~78         | ~90       | ~115%            |
  +--------------------------------------+-------------+-----------+------------------+
reply
Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.

Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.

reply
>Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.

Ah, the "trust me bro" advantage. Couldn't it just be brand identity and familiarity?

reply
I have a project where we've had Opus, Sonnet, Deepseek, Kimi, Qwen create and execute an aggregate total of about 350 plans so far, and the quality difference as measured in plans where the agent failed to complete the tasks on the first run is high enough that it comes out several times higher than Anthropics subscription prices, but probably cheaper than the API prices once we have improved the harness further - at present the challenge is that too much human intervention for the cheaper models drives up the cost.

My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.

I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.

I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.

That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.

Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.

reply
Where are you getting those benchmark figures from? Math-500 should be closer to 98% for both models: https://artificialanalysis.ai/evaluations/math-500?models=de...
reply
> That being said not all users max out their plan,

These are not cell phone plans which the average joe takes, they are plans purchased with the explicit goal of software development.

I would guess that 99 out of every 100 plans are purchased with the explicit goal of maxing them out.

reply
I’m not maxing them out… I have issues that I need to fix, features I need to develop, and I have things I want to learn.

When I have a feeling that these tools will speed me up, I use them.

My client pays for a couple of these tools in an enterprise deal, and I suspect most of us on the team work like that.

If my goal was to max out every tool my client pays, I’d be working 24hrs a day and see no sunlight ever.

I guess it’s like the all you can eat buffet. Everybody eats a lot, but if you eat so much that you throw up and get sick, you are special.

reply
[flagged]
reply
My employer bought me a Claude Max subscription. On heavy weeks I use 80% of the subscription. And among software engineers that I know, I'm a relatively heavy user.

Why? Because in my experience, the bottleneck is in shareholders approving new features, not my ability to dish out code.

reply
goal? yeah. but in reality just timing it right (starting a session at 7-8am, to get 2 sessions in a workday, or even 3 if you can schedule something at 5am), i rarely hit limits.

if i hit the limit usually i'm not using it well and hunting around. if i'm using it right i'm basically gassed out trying to hit the limit to the max.

reply
There’s absolutely no way that’s true.
reply
In saas this is not true. Most saas is highly profitable or was i suppose because they knew that most of their customers would never max out their plans.
reply