undefined

points

[-]

It comes from batching and multiple streams on a GPU. More people sharing 1 GPU makes everyone run slower but increases overall token throughput.

Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.

Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.

See this graph for actual numbers:

Token Throughput per GPU vs. Interactivity gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™

https://inferencemax.semianalysis.com/

by vlovich1231 hours ago|

parent|

[-]

> If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.

I think you skipped the word “total throughout” there right? Cause tok/s is a measure of throughput, so it’s clearer to say you increase throughput/user at the expense of throughput/gpu.

I’m not sure about the comment about speculative decode though. I haven’t served a frontier model but generally speculative decode I believe doesn’t help beyond a few tokens, so I’m not sure you can “speculatively decode harder” with fewer users.

by sothatsit5 hours ago|

prev|

[-]

There are a lot of knobs they could tweak. Newer hardware and traffic prioritisation would both make a lot of sense. But they could also lower batching windows to decrease queueing time at the cost of lower throughput, or keep the KV cache in GPU memory at the expense of reducing the number of users they can serve from each GPU node.

by martinald10 minutes ago|

parent|

[-]

I think it's just routing to faster hardware:

H100 SXM: 3.35 TB/s HBM3

GB200: 8 TB/s HBM3e

2.4x faster memory - which is exactly what they are saying the speedup is. I suspect they are just routing to GB200 (or TPU etc equivalents).

FWIW I did notice _sometimes_ recently Opus was very fast. I put it down to a bug in Claude Code's token counting, but perhaps it was actually just occasionally getting routed to GB200s.

by jstummbillig5 hours ago|

prev|

[-]

> It seems unlikely it’s just prioritization

Why does this seem unlikely? I have no doubt they are optimizing all the time, including inference speed, but why could this particular lever not entirely be driven by skipping the queue? It's an easy way to generate more money.

by AnotherGoodName2 hours ago|

parent|

[-]

Yes it's 100% prioritization. Through that it's also likely running on more GPUs at once but that's an artifact of prioritization at the datacenter level. Any task coming into an AI datacenter atm is split into fairly fined grained chunks of work and added to queues to be processed.

When you add a job with high priority all those chunks will be processed off the queue first by each and every GPU that frees up. It probably leads to more parallelism but... it's the prioritization that led to this happening. It's better to think of this as prioritization of your job leading to the perf improvement.

Here's a good blog for anyone interested which talks about prioritization and job scheduling. It's not quite at the datacenter level but the concepts are the same. Basically everything is thought of as a pipeline. All training jobs are low pri (they take months to complete in any case), customer requests are mid pri and then there's options for high pri. Everything in an AI datacenter is thought of in terms of 'flow'. Are there any bottlenecks? Are the pipelines always full and the expensive hardware always 100% utilized? Are the queues backlogs big enough to ensure full utilization at every stage?

https://www.aleksagordic.com/blog/vllm

by kgeist1 hours ago|

parent|

[-]

>Yes it's 100% prioritization

Amazon Bedrock has a similar feature called "priority tier": you get faster responses at 1.75x the price. And they explicitly say in the docs "priority requests receive preferential treatment in the processing queue, moving ahead of standard requests for faster responses".

by singpolyma35 hours ago|

parent|

prev|

[-]

Until everyone buys it. Like fast pass at an amusement park where the fast line is still two hours long

by sothatsit4 hours ago|

parent|

[-]

At 6x the cost, and it requiring you to pay full API pricing, I don’t think this is going to be a concern.

by servercobra4 hours ago|

parent|

prev|

[-]

It's a good way to squeeze extra out of a bunch of people without actually raising prices.

by Nition5 hours ago|

prev|

[-]

I wonder if they might have mostly implemented this for themselves to use internally, and it is just prioritization but they don't expect too many others to pay the high cost.

by sothatsit5 hours ago|

parent|

[-]

Roon said as much here [0]:

> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow

[0] https://nitter.net/tszzl/status/2016338961040548123

by Nition1 hours ago|

parent|

[-]

I see Anthropic says so here as well: https://x.com/claudeai/status/2020207322124132504

by re-thc2 hours ago|

prev|

[-]

Nvidia GB300 i.e. Blackwell.

by pshirshov5 hours ago|

prev|

[-]

> so what else is changing?

Let me guess. Quantization?