undefined

points

[-]

There isn't a relationship between parameter size and energy use like that. You could run a 280B parameter model on a Raspberry Pi with a big SSD if you were so determined. The energy use would be small, but you would be waiting a very long time for your response.

Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.

This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.

by zozbot23411 hours ago|

parent|

[-]

You're thinking about power use, not energy. There are systems that can more directly minimize energy per operation at the cost of high latency but they look more like TPUs than Raspberry Pi's.

by zozbot23414 hours ago|

prev|

[-]

Energy use for any given request is going to be roughly proportional to active parameters, not total. That would be something like 13B for Flash and 49B for Pro. So you'd theoretically get something like 190W if you could keep the same prefill and decode speed as Flash, which is unlikely.

by eurekin14 hours ago|

prev|

[-]

Batching lowers that, since the model is read once from memory. Activation accumulation doesn't scale as nicely

by wmf14 hours ago|

prev|

[-]

Power isn't proportional to parameters. It may be vaguely proportional to tokens/s although batching screws that up.

Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.