The decision is always a mix between how good we can make the models from a technical aspect, with how good they need to be to make all of you super excited to use them. And its a bit of a challenge what is an ever changing ecosystem.
I'm personally curious is there a certain parameter size you're looking for?
I would personally love to see a super sparse 200B A3B model, just to see what is possible. These machines don't have a lot of bandwidth, so a low active count is essential to getting good speed, and a high total parameter count gives the model greater capability and knowledge.
It would also be essential to have the Q4 QAT, of course. Then the 200B model weights would take up ~100GB of memory, not including the context.
The common 120B size these days leaves a lot of unused memory on the table on these machines.
I would also like the larger models to support audio input, not just the E2B/E4B models. And audio output would be great too!
That seems pointless. You can achieve that with a single 24G graphics card already.
I wonder if it would even hold up at that level, as 3B active is really not a lot to work with. Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.
I don't see any value proposition for these little boxes like DGX Spark and Strix Halo. Lots of too-slow RAM to do anything useful except run mergekit. imo you'd have been better building a desktop computer with two 3090s.
For the sake of argument, even if we take that old rule of thumb at face value, you can see how the MoE still wins:
- (DGX Spark) 273GB/s of memory bandwidth with 3B active parameters at Q4 = 273 / 1.5 = 182 tokens per second as the theoretical maximum.
- (RTX 3090) 936GB/s with 24B parameters at Q4 = 936 / 12 = 78 tokens per second. Or 39 tokens per second if you wanted to run at Q8.
The "slow" DGX Spark is now more than twice as fast as the RTX 3090, thanks to an appropriate MoE architecture. Even with two RTX 3090s, you would still be slower. Yes, an RTX 5090 would close that gap significantly, but you mentioned RTX 3090s, and I also have an RTX 3090-based AI desktop.
(The above calculation is dramatically oversimplified, but the idea holds. Token generation is bandwidth limited.)
The mid-size frontier models are rumored to be extremely sparse like that, but 10x larger on both total and active. No one has ever released an open model that sparse for us to try out.
As I said, I wanted to see what is possible for Google to achieve.
> Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.
From what I've seen, having used both, I would anecdotally report that the 122B model is better in ways that aren't reflected in benchmarks, with more inherent knowledge and more adaptability. This is arguably why MoEs are so popular in the first place. Otherwise the AI labs would all be making a smaller dense model of the same "intelligence". No one believes any of the frontier models are dense models.
gpt-oss-120b is over 600 tokens/s PP for all but one backend.
nemotron-3-super is at best 260 tokens/s PP.
Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec
That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.
gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k
Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.)
Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...
Nemotron3-super is, without question, my favorite model now for my agentic use cases. The closest model I would compare it to, in vibe and feel, is the Qwen family but this thing has an ability to hold attention through complicated (often noisy) agentic environments and I'm sometimes finding myself checking that i'm not on a frontier model.
I now just rent a Dual B6000 on a full-time basis for myself for all my stuff; this is the backbone of my "base" agentic workload, and I only step up to stronger models in rare situations in my pipelines.
The biggest thing with this model, I've found, is just making sure my environment is set up correctly; the temps and templates need to be exactly right. I've had hit-or-miss with OpenRouter. But running this model on a B6000 from Vast with a native NVFP4 model weight from Nvidia, it's really good. (2500 peak tokens/sec on that setup) batching. about 100/s 1-request, 250k context. :)
I can run on a single B6000 up to about 120k context reliably but really this thing SCREAMS on a dual-b6000. (I'm close to just ordering a couple for myself it's working so well).
Good luck .. (Sometimes I feel like I'm the crazy guy in the woods loving this model so much, I'm not sure why more people aren't jumping on it..)
Was it too good or not good enough? (blink twice if you can't answer lol)
(I have a DGX Spark, and MBP w/ 128gb).
(I've mentioned this before but AIUI it would require some new feature definitions in GGUF, to allow for coalescing model data about any one expert-layer into a single extent, so that it can be accessed in bulk. That's what seems to make the new Flash-MoE work so well.)
Also, as I understand it the 26B is the MOE and the 31B is dense - why is the larger one dense and the smaller one MOE?
Isn't that more dictated by the competition you're facing from Llama and Qwent?
I personally strive to build software and models provides provides the best and most usable experience for lots of people. I did this before I joined google with open source, and my writing on "old school" generative models, and I'm lucky that I get to this at Google in the current LLM era.