undefined

points

by vicchenai4 hours ago |

comments

by p_ing3 hours ago|

[-]

4K random read with a queue depth of 1 on an M1 Max is about 65MB/s.

by tatef3 hours ago|

prev|

[-]

Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.

by zozbot2344 hours ago|

prev|

[-]

> for a 1T model youd need to stream something like 2TB of weights per forward pass

Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

by visarga4 hours ago|

parent|

[-]

But across a sequence you still have to load most of them.