upvote
I suspect they quantize them, reduce thinking budgets, batch more requests, or all of the above.
reply
There's also lowering the number of experts you run in MoE models.
reply