upvote
Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.
reply
That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.
reply
To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.
reply
Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...
reply
That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).
reply