I think you skipped the word “total throughout” there right? Cause tok/s is a measure of throughput, so it’s clearer to say you increase throughput/user at the expense of throughput/gpu.
I’m not sure about the comment about speculative decode though. I haven’t served a frontier model but generally speculative decode I believe doesn’t help beyond a few tokens, so I’m not sure you can “speculatively decode harder” with fewer users.