upvote
We are always figuring out what parameter size makes sense.

The decision is always a mix between how good we can make the models from a technical aspect, with how good they need to be to make all of you super excited to use them. And its a bit of a challenge what is an ever changing ecosystem.

I'm personally curious is there a certain parameter size you're looking for?

reply
For the many DGX Spark and Strix Halo users with 128GB of memory, I believe the ideal model size would probably be a MoE with close to 200B total parameters and a low active count of 3B to 10B.

I would personally love to see a super sparse 200B A3B model, just to see what is possible. These machines don't have a lot of bandwidth, so a low active count is essential to getting good speed, and a high total parameter count gives the model greater capability and knowledge.

It would also be essential to have the Q4 QAT, of course. Then the 200B model weights would take up ~100GB of memory, not including the context.

The common 120B size these days leaves a lot of unused memory on the table on these machines.

I would also like the larger models to support audio input, not just the E2B/E4B models. And audio output would be great too!

reply
Following the current rule of thumb MoE = `sqrt(param*active)` a 200B-A3B would have the intelligence of a ~24B dense model.

That seems pointless. You can achieve that with a single 24G graphics card already.

I wonder if it would even hold up at that level, as 3B active is really not a lot to work with. Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

I don't see any value proposition for these little boxes like DGX Spark and Strix Halo. Lots of too-slow RAM to do anything useful except run mergekit. imo you'd have been better building a desktop computer with two 3090s.

reply
Large MoE models are too heavily bottlenecked on typical discrete GPUs. You end up pushing just a few common/non-shared layers to GPU and running the MoE part on CPU, because the bandwidth of PCIe transfers to a discrete GPU is a killer bottleneck. Platforms with reasonable amounts of unified memory are more balanced despite the lower VRAM bandwidth, and can more easily run even larger models by streaming inactive weights from SSD (though this quickly becomes overkill as you get increasingly bottlenecked by storage bandwidth: you'd be better off then with a plain HEDT accessing lots of fast storage in parallel via abundant PCIe lanes).
reply
That rule of thumb was invented years ago, and I don’t think it is relevant anymore, despite how frequently it is quoted on Reddit. It is certainly not the "current" rule of thumb.

For the sake of argument, even if we take that old rule of thumb at face value, you can see how the MoE still wins:

- (DGX Spark) 273GB/s of memory bandwidth with 3B active parameters at Q4 = 273 / 1.5 = 182 tokens per second as the theoretical maximum.

- (RTX 3090) 936GB/s with 24B parameters at Q4 = 936 / 12 = 78 tokens per second. Or 39 tokens per second if you wanted to run at Q8.

The "slow" DGX Spark is now more than twice as fast as the RTX 3090, thanks to an appropriate MoE architecture. Even with two RTX 3090s, you would still be slower. Yes, an RTX 5090 would close that gap significantly, but you mentioned RTX 3090s, and I also have an RTX 3090-based AI desktop.

(The above calculation is dramatically oversimplified, but the idea holds. Token generation is bandwidth limited.)

The mid-size frontier models are rumored to be extremely sparse like that, but 10x larger on both total and active. No one has ever released an open model that sparse for us to try out.

As I said, I wanted to see what is possible for Google to achieve.

> Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

From what I've seen, having used both, I would anecdotally report that the 122B model is better in ways that aren't reflected in benchmarks, with more inherent knowledge and more adaptability. This is arguably why MoEs are so popular in the first place. Otherwise the AI labs would all be making a smaller dense model of the same "intelligence". No one believes any of the frontier models are dense models.

reply
Kimi 2.5 is relatively sparse at 1T/32B; GLM 5 does 744B/40B so only slightly denser. Maybe you could try reducing active expert count on those to artificially increase sparsity, but I'm sure that would impact quality.
reply
200a10b please, 200a3b is too little active to have good intelligence IMO and 10b is still reasonably fast.
reply
120B would be great to have if you have it stashed away somewhere. GPT-OSS-120B still stands as one of the best (and fastest) open-weights models out there. A direct competitor in the same size range would be awesome. The closest recent release was Qwen3.5-122B-A10B.
reply
Nemotron 3 Super was released recently. That's a direct competitor to gpt-oss-120b. https://developer.nvidia.com/blog/introducing-nemotron-3-sup...
reply
In terms of ability, maybe, in terms of speed, it's not even close. Check out the Prompt Processing speeds between them: https://kyuz0.github.io/amd-strix-halo-toolboxes/

gpt-oss-120b is over 600 tokens/s PP for all but one backend.

nemotron-3-super is at best 260 tokens/s PP.

Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec

That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.

reply
In my case with vLLM on dual RTX Pro 6000

gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k

Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.)

Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...

reply
I gave it a whirl but was unenthused. I'll try it again, but so far have not really enjoyed any of the nvidia models, though they are best in class for execution speed.
reply
I'll pipe in here as someone working on an agentic harness project using mastra as the harness.

Nemotron3-super is, without question, my favorite model now for my agentic use cases. The closest model I would compare it to, in vibe and feel, is the Qwen family but this thing has an ability to hold attention through complicated (often noisy) agentic environments and I'm sometimes finding myself checking that i'm not on a frontier model.

I now just rent a Dual B6000 on a full-time basis for myself for all my stuff; this is the backbone of my "base" agentic workload, and I only step up to stronger models in rare situations in my pipelines.

The biggest thing with this model, I've found, is just making sure my environment is set up correctly; the temps and templates need to be exactly right. I've had hit-or-miss with OpenRouter. But running this model on a B6000 from Vast with a native NVFP4 model weight from Nvidia, it's really good. (2500 peak tokens/sec on that setup) batching. about 100/s 1-request, 250k context. :)

I can run on a single B6000 up to about 120k context reliably but really this thing SCREAMS on a dual-b6000. (I'm close to just ordering a couple for myself it's working so well).

Good luck .. (Sometimes I feel like I'm the crazy guy in the woods loving this model so much, I'm not sure why more people aren't jumping on it..)

reply
Jeff Dean apparently didn't get the message that you weren't releasing the 124B Moe :D

Was it too good or not good enough? (blink twice if you can't answer lol)

reply
Mainline consumer cards are 16GB, so everyone wants models they can run on their $400 GPU.
reply
Yea, I've been waiting a while for a model that is ~12-13GB so there is still a bit of extra headroom for all the different things running on the system that for some reason eat VRAM.
reply
My sweet spot is something that runs on less than 128gb.

(I have a DGX Spark, and MBP w/ 128gb).

reply
I'll pipe in - a series of Mac optimized MOEs which can stream experts just in time would be really amazing. And popular; I'm guessing in the next year we'll be able to run a very able openclaw with a stack like that. You'll get a lot of installs there. If I were a PM at Gemma, I'd release a stack for each Mac mini memory size.
reply
Expert streaming is something that has to be implemented by the inference engine/library, the model architecture itself has very little to do with it. It's a great idea (for local inference; it uses too much power at scale), but making it work really well is actually not that easy.

(I've mentioned this before but AIUI it would require some new feature definitions in GGUF, to allow for coalescing model data about any one expert-layer into a single extent, so that it can be accessed in bulk. That's what seems to make the new Flash-MoE work so well.)

reply
I’ve been doing some low-key testing on smaller models, and it looks to me like it’s possible to train an MOE model with characteristics that are helpful for streaming… For instance, you could add a loss function to penalize expert swapping both in a single forward, pass and across multiple forward passes. So I believe there is a place for thinking about this on the model training side.
reply
Penalizing expert swaps doesn't seem like it would help much, because experts vary by layer and are picked layer-wise. There's no guarantee that expert X in layer Y that was used for the previous token will still be available for this token's load from layer Y. The optimum would vary depending on how much memory you have at any given moment, and such. It's not obviously worth optimizing for.
reply
Something in the 60B to 80B range would still be approachable for most people running local models and also could give improved results over 31B.

Also, as I understand it the 26B is the MOE and the 31B is dense - why is the larger one dense and the smaller one MOE?

reply
how good they need to be to make all of you super excited to use them

Isn't that more dictated by the competition you're facing from Llama and Qwent?

reply
This is going to sound like a corp answer but I mean this genuinely as an individual engineer. Google is a leader in its field and that means we get to chart our own path and do what is best for research and for users.

I personally strive to build software and models provides provides the best and most usable experience for lots of people. I did this before I joined google with open source, and my writing on "old school" generative models, and I'm lucky that I get to this at Google in the current LLM era.

reply