upvote
That’s what, 14GB/s? The GPU‘s VRAM can do 100x that.
reply
A discrete consumer GPU card doesn't have enough fast RAM to run a very large model that hasn't been quanitized to hell.

That's why all the projects streaming models into the GPU from an SSD popped up recently.

reply
Yes. There’s just no way to get above 1t/s that way with a large model.
reply