You can run Q3.5-35B-A3B at ~100 tok/s.
I tried G4 26B A4B as a drop-in replacement of Q3.5-35B-A3B for some custom agents and G4 doesn't respect the prompt rules at all. (I added <|think|> in the system prompt as described (but have not spend time checking if the reasoning was effectively on). I'll need to investigate further but it doesn't seem promising.
I also tried G4 26B A4B with images in the webui, and it works quite well.
I have not yet tried the smaller models with audio.
That's what I meant by "waiting a few days for updates" in my other comment. Qwen 3.5 release, I remember a lot of complaints about: "tool calling isn't working properly" etc.
That was fixed shortly after: there was some template parsing work in llama.cpp. and unsloth pulled out some models and brought back better one for improving something else I can't quite remember, better done Quantization or something...
coder543 pointed out the same is happening regarding tool calling with gemma4: https://news.ycombinator.com/item?id=47619261
I'll try in a few days. It's great to be able to test it already a few hours after the release. It's the bleeding edge as I had to pull the last from main. And with all the supply chain issues happening everywhere, bleeding edge is always more risky from a security point of view.
There is always also the possibility to fine-tune the model later to make sure it can complete the custom task correctly. But the code for doing some Lora for gemma4 is probably not yet available. The 50% extra speed seems really tempting.