upvote
Don't forget that you're also spending much more electricity because it takes so long to run inference.
reply
I have been using Qwen3.5-9B-UD-Q4_K_XL.gguf on an 8GB 3070Ti with llama.cpp server and I get 50-60 tok/s.
reply