upvote
Mixture-of-Expert (MoE) was introduced in the 1990s [1, 2], see also [3, 4]. The idea was that MoE scales up model capacity and only introduces small computation overhead. MoEs did not become viable for high-performance applications until sparse routing was integrated with modern deep networks, made possible by large-scale distributed computation. The breakthrough came with the development of sparsely gated networks [5], which showed that it is possible to maintain model accuracy while activating only a small fraction of a large parameter network during both training and inference.

[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts. (1991)

[2] M. I. Jordan, R. A. Jacobs, Hierarchical mixtures of experts and the EM algorithm. (1993)

[3] L. Xu, M. Jordan, G. E. Hinton, An alternative model for mixtures of experts. (1994)

[4] S. Waterhouse, D. MacKay, A. Robinson, Bayesian methods for mixtures of experts. (1995)

[5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (2017)

reply
Yes - I meant as applied to LLMs/Transformers.
reply