upvote
you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.
reply
Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.

Cloud offerings are 80-200tk/sec versus single digit tk/sec.

That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.

reply
I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.
reply