MoE models are typically designed for datacenter deployment, where per-token load-balancing is more important, but it's also possible to use a different training objective that encourages domain-specialization of experts:
https://allenai.org/blog/emo But yes, this isn't really useful for distributed training as such because of the router.