undefined

points

[-]

you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.

by nijave6 hours ago|

prev|

[-]

Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.

Cloud offerings are 80-200tk/sec versus single digit tk/sec.

That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.

by edg50006 hours ago|

parent|

[-]

I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.