The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.
If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if
> Sorry, how did you calculate the 10.25B?
The geometric mean of two numbers is the square root of their product. Square root of 105 (35*3) is ~10.25.
Nevermind, the other reply clears it