undefined

points

[-]

The important thing about MoEs which I mention in the conclusion is that they carry fewer (way fewer) active tokens during inference/generation.

35B-A3B is what we started out with in the days of only having the 3090, but the quality is not as good, and the speed from the cards we have now can blaze at 130-200 tokens per second of generation with q5 and a full context in fp16.

Not to say that MoEs don't have their place. For people running on unified RAM, they're sometimes the only viable option due to the slowness of dense models.

Why is a dense model slower? All model weights have to be loaded and exercised. Passing through 27B vs 3B (active) is maths. So yes you will always get more tokens per second of generation.

You must (just as we did) evaluate on your own products and daily work. If the MoE gives the results you need with only 3B parameters then you have your answer.

Not prescriptive at all. This is experience based, from the trenches of a actual software business so hopefully a different perspective for folks than "Ran Qwen on my macbook, generated a great python script for me"