undefined

points

[-]

Prefill will be a huge deal if batched unattended inference of SOTA models (on consumer platforms) becomes viable, because at that point it's the main remaining bottleneck. If running 30 inferences together boosts your decode throughput to 3x (that's consistent with some very rough experiments, though these haven't even looked at trying to mask SSD offload latency just yet), that's a 10x in total decode time but a 30x in total prefill time, because prefill workloads are fully compute bound already on consumer platforms and don't benefit from batching much at all.