undefined

points

[-]

That's not really how the experts in an MoE work. They activate on token probabilities and are activated on every token. You don't necessarily have a discrete math expert and a discrete physics expert. And if it were you would still need a router that is trained on all of those domains.

by yorwba7 hours ago|

parent|

[-]

MoE models are typically designed for datacenter deployment, where per-token load-balancing is more important, but it's also possible to use a different training objective that encourages domain-specialization of experts: https://allenai.org/blog/emo But yes, this isn't really useful for distributed training as such because of the router.