undefined

points

[-]

You can improve that with speculative preload. I'm sure models could be designed and tuned around efficient SSD offloading to keep throughput pretty high.

by searealist1 hours ago|

parent|

[-]

It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.

by rsalus1 hours ago|

parent|

prev|

[-]

surely the supply of unified memory will rise to meet demand before this is needed