undefined

points

[-]

I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio.

by pawelduda3 hours ago|

parent|

[-]

Wow, that sounds usable. I know it's anecdotal but how did you find the quality of the output, and can you compare it to any closed source model?

by girvo11 hours ago|

prev|

[-]

Not that you asked but I’m getting ~20 tokens/s on my DGX Spark (Asus actually) using an Int4 AutoRound quant, MTP 1 and some other tricks

by overgard15 hours ago|

prev|

[-]

Can't answer for an RTX 5090, but for an RTX 5080 16GB of RAM (desktop), I get about 6 tokens/sec after some tweaking (f16->q4_0). Kind of on the borderline of usable.. probably realistically need either a 5090 with more RAM or something like a Mac with a unified memory architecture.

by datadrivenangel15 hours ago|

parent|

[-]

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

by angoragoats14 hours ago|

parent|

prev|

[-]

A Mac is not going to be all that much faster than a 5080 with any models, other than the ones you can’t currently run at all because you don’t have enough GPU+CPU memory combined.

You’re much better off adding a second GPU if you’ve already got a PC you’re using.