It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.