undefined

points

[-]

In early tests the performance of gemma-4-31B was affected by tokenizer bugs in many of the existing backends, like llama.cpp, which were later corrected by their maintainers.

Moreover, tool invocation had problems that were later corrected by Google in an updated chat template.

So any early benchmarks that have shown the dense model as inferior to the MoE model are likely to be flawed and they must be repeated after updating both the inference backend and the model.

All benchmarks that I have seen after the bugs were fixed have shown the dense model as clearly superior in quality, even if much slower.

by gertlabs1 days ago|

parent|

[-]

We add samples every week, so I'm curious if the numbers will move.

They did a similar re-release during the Gemini 3.1 Pro Preview rollout, and released a custom-tools version with its own slug, which performs MUCH better on custom harnesses (mostly because the original release could not figure out tool call formatting at all).