TTFT is under 2 seconds average. Worst case is 10-30s.
Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?
If I keep sending large context buffers, will that hog the batches?
not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead