upvote
No real change in inference speed. It basically just allows me to slot in more context or a bigger model.

A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.

Sometimes that matters, a lot of times it doesn't.

On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.

I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).

reply
You’re adding extra gpu for more vram, not speed.
reply