Hacker News
new
past
comments
ask
show
jobs
points
by
Aurornis
2 hours ago
|
comments
by
CuriouslyC
2 hours ago
|
[-]
You can improve that with speculative preload. I'm sure models could be designed and tuned around efficient SSD offloading to keep throughput pretty high.
reply
by
searealist
1 hours ago
|
parent
|
next
[-]
It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.
reply
by
rsalus
1 hours ago
|
parent
|
prev
|
[-]
surely the supply of unified memory will rise to meet demand before this is needed
reply