upvote
The size of the KV cache (context stored) is proportional to the number of layers of the model and number of "hidden dimensions". For a 400B model it could be 30-60GB for just an 8K context window (depends on the model, etc, just a ballpark).

So shrinking that by 6x (from fp16), would be big win for larger models. True, while TurboQuant can also be applied to model weights, it won't save size over q4 compression, but will have better accuracy.

Edits: Better context

reply
That's my hope as well as I tend to use low end GPUs (e.g. NVIDIA GeForce RTX 2060 @ 6GB). Been looking for an image generation model that can fit that vid card, for use with Ollama + GUI in Linux. No luck yet, since money's tight and jobs are tighter :(
reply
An Arc B580 will just about fit Flux.2 Klein (At FP8). However, you can also easily get much larger GPUs on RunPod or Vast at $0.25/hr.

I would strongly recommend exploring that option, renting an RTX 5090 for an evening of image generation for a dollar or two is way more fun then trying to jam big models on little cards. Just take some time to create a reasonable, scripted, deployment workflow for when you create a fresh instance.

reply
hey what's your Venmo?
reply