upvote
You can improve that with speculative preload. I'm sure models could be designed and tuned around efficient SSD offloading to keep throughput pretty high.
reply
It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.
reply
surely the supply of unified memory will rise to meet demand before this is needed
reply