upvote
> The model weights stay resident in VRAM permanently so there's no loading/unloading per request.

Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?

If I keep sending large context buffers, will that hog the batches?

reply
deleted
reply
Not if you are the only one. We have rate limits to prevent this in case, idk, you share your key with 1000 people lol.
reply