upvote
It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.
reply
surely the supply of unified memory will rise to meet demand before this is needed
reply