I’ve been doing some low-key testing on smaller models, and it looks to me like it’s possible to train an MOE model with characteristics that are helpful for streaming… For instance, you could add a loss function to penalize expert swapping both in a single forward, pass and across multiple forward passes. So I believe there is a place for thinking about this on the model training side.
replyPenalizing expert swaps doesn't seem like it would help much, because experts vary by layer and are picked layer-wise. There's no guarantee that expert X in layer Y that was used for the previous token will still be available for this token's load from layer Y. The optimum would vary depending on how much memory you have at any given moment, and such. It's not obviously worth optimizing for.
reply