upvote
Thanks! These things you're mentioning like "You may be able to offload some layers to GPU...", "You can keep the KV cache on GPU..." configured as part of the llama.cpp? I wouldn't know what to prompt with or how to evaluate "correctness" (outside of literally feeding your comment into claude and seeing what happens).

Aside: what is your tooling setup? Which harness you're using (if any), what's running the inference and where, what runs in WSL vs Windows, etc.

I struggle to even ask the right questions about the workflow and environment.

reply
In my case, I was also running an ASR model and a TTS model so it was a bit much for my RTX 3090. I opted to offset like 5 layers to the cpu while adding a GPU-only speculative decoding with their 0.8B model.

Working well so far.

reply