upvote
While you can run with weights in RAM or even disk, it gets a lot slower; even though on any given token a fraction of the weights are used, that can change with each token, so there is a lot of traffic to transfer weights to the GPU, which is a lot slower than if it's directly in GPU RAM. And even more slower if you stream from disk. Possible, yes, and maybe OK for some purposes, but you might find it painfully slow.
reply