upvote
Penalizing expert swaps doesn't seem like it would help much, because experts vary by layer and are picked layer-wise. There's no guarantee that expert X in layer Y that was used for the previous token will still be available for this token's load from layer Y. The optimum would vary depending on how much memory you have at any given moment, and such. It's not obviously worth optimizing for.
reply