upvote
I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio.
reply
Wow, that sounds usable. I know it's anecdotal but how did you find the quality of the output, and can you compare it to any closed source model?
reply
Not that you asked but I’m getting ~20 tokens/s on my DGX Spark (Asus actually) using an Int4 AutoRound quant, MTP 1 and some other tricks
reply
Can't answer for an RTX 5090, but for an RTX 5080 16GB of RAM (desktop), I get about 6 tokens/sec after some tweaking (f16->q4_0). Kind of on the borderline of usable.. probably realistically need either a 5090 with more RAM or something like a Mac with a unified memory architecture.
reply
My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.
reply
A Mac is not going to be all that much faster than a 5080 with any models, other than the ones you can’t currently run at all because you don’t have enough GPU+CPU memory combined.

You’re much better off adding a second GPU if you’ve already got a PC you’re using.

reply