undefined

points

[-]

You're forgetting a critical factor: concurrency. If a given hardware serves a single request at 150 tokens/s, it can also serve 20-30 requests at 100 tokens/s. Suddenly your $5K becomes $100K/month, enough to recoup the cost of the hardware in a year or so.

The reason it works: each time you read the model (memory bound) to calculate the next token, you can also update multiple requests (compute bound) while at it. It's also much more energy-efficient per token.

[1] https://aimultiple.com/gpu-benchmark

by dakolli8 hours ago|

parent|

[-]

Interesting I didn't know about this, but it makes sense after reading the article. They are benchmarking on a single GPU on a 20bb param model. Does it scale across 60 H100s over NVLink/NVSwitch. I would be interested to see those benchmarks.

The idea that everyone is spinning up a $2 million in GPUs to scan their email inbox, search the web or avoid learning something is still ridiculous to me regardless.