I'm sure Anthropic is making money off the API but I highly doubt it's 90% profit margins.
Unlikely. Amazon Bedrock serves Opus at 120tokens/sec.
If you want to estimate "the actual price to serve Opus", a good rough estimate is to find the price max(Deepseek, Qwen, Kimi, GLM) and multiply it by 2-3. That would be a pretty close guess to actual inference cost for Opus.
It's impossible for Opus to be something like 10x the active params as the chinese models. My guess is something around 50-100b active params, 800-1600b total params. I can be off by a factor of ~2, but I know I am not off by a factor of 10.
Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.
I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most.
> Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.
You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware.
Hint: what's in the kv cache when you start processing the 2nd token?
And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.
Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.
Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.
In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.
The Trillions of parameters claim is about the pretraining.
It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase.
However those models end up very sparse and incredibly distillable.
And it’s way too expensive and slow to serve models that size so they are distilled down a lot.
Since then inference pricing for new models has come down a lot, despite increasing pressure to be profitable. Opus 4.6 costs 1/3rd what Opus 4.0 (and 3.5) costs, and GPT 5.4 1/4th what o1 costs. You could take that as indication that inference costs have also come done by at least that degree.
My guess would have been that current frontier models like Opus are in the realm of 1T params with 32B active
42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B, which makes sense since denser models are easier to optimize. As a lower bound, we get 4576B / 42 ≈ 109B active parameters for Opus 4.6. (This makes the assumption that all three models use the same number of bits per parameter and run on the same hardware.)I'd say Opus is roughly 2x to 3x the price of the top Chinese models to serve, in reality.
Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)
So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)
Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.
Opus is 2T-3T in size at most.
From my understanding, the "besides training" is a big issue. As I noted earlier[1], Qwen3 was much better than Qwen2.5, but the main difference was just more and better training data. The Qwen3.5-397B-A17B beat their 1T-parameter Qwen3-Max-Base, again a large change was more and better training data.
However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.
This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.