upvote
Output tokens are actually kinda expensive for the provider.

The input cache hit tokens are incredibly cheap for them, (incredibly high margin too, except for deepseek).

And input tokens are in the middle. Input tokens can be processed very efficiently.

Also his math is wrong. $100k gets you 22.7B output tokens at $4.4/M which is how much GLM 5.2 costs.

At 500/s 22.7B is just 500 days. Or about 1.54 years. Which is much less then the life of the hardware.

reply
Inference providers have been getting a firehose of investor cash to keep the chips running (and are looking around very nervously as that firehose starts to sputter).
reply
The inference providers are running batch sizes much larger than 10
reply