upvote
For Qwen 3.5 0.8B presumably you're running it unquantized, because it's so small. Get at least the Q8 of Gemma 4 12B with the F32 mmproj and use an f16 kv cache.

Then run it with the latest llama.cpp that contains the Gemma 4 12B unified bug fixes, using --image-min-tokens 560 --image-max-tokens 2240 --batch-size 4096 --ubatch-size 4096 --temp 1.0 --top-p 0.95 --top-k 64 --jinja

It's understanding far more complex things for me and can reliably handle tiny text, so it should be easily understanding an image that only contains the text "This is a test".

reply
That sounds like a bug. They're very common for open model releases on the first day. If I wasn't on mobile I'd try it on Google's own app.
reply
Test it on a professional inference provider to rule out trouble on your end.
reply
Sounds like you're doing it wrong, to be honest.
reply
I guess Google implements more / stronger guard rails than Alibaba and thus confuses these small models. At least this was my impression with Gemma3 models where it often said that the image contains some nudity / sex scenes and therefore it cannot give a description of the image. Never understood the point of this behavior....
reply
The biggest problem with all the Google models has always been RLHF, particularly safety training. They take a good, smart model and make it behave like a corporate person that has been to far to many forced anti-{sexism, racism...} seminars so that it is now living in fear of saying something that could be construed as wrong by some moral standard.
reply
This is almost certainly not true.

If it was, they wouldn't need to be using the classifiers they are using to warn Gemini about problematic prompts.

reply
I've always found the Gemma models to vastly under-perform on vision tasks compared to Qwen so that's nothing new.
reply
The Qwen series adopted vision wayyy earlier than anyone else. No idea why the other labs were sleeping on it but they had about 2 years of experimentation without any competition.
reply