undefined

points

[-]

DeepSeekv4+ will have image capability, they said so in their paper. GLM whenever they decide to. Both companies have they tech and for whatever reason haven't decide to prioritize it. Both of their OCR are SOTA among all OCR models closed or open. GLM demonstrated they know how to do this, with GLM-4.6V.

by dryarzeg7 hours ago|

prev|

[-]

Yes, you are right (as far as I'm aware). For things where you need the LLM to look at screenshots, photos or other images you can use Kimi-K2.6/K2.7 - comparable pricing, somewhat comparable performance and quality. You can even probably combine two models (e.g Kimi and GLM) in one agent, using Kimi for multimodal inputs and GLM for everything else, although 1) I'm not sure if this will not cause some kind of context poisoning with low-quality patterns for better performing model (e.g. in some cases Kimi may be worse than GLM, but GLM, when following up, may adopt the same reasoning patterns as Kimi, undermining it's own performance), and 2) I'm not quite sure if it's possible with the tools currently available (I'm not really into agentic or chatbots stuff to be honest).

by mordae8 hours ago|

prev|

[-]

They do not and it sucks for certain tasks.

It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.

by osti7 hours ago|

parent|

[-]

Many other open source models have vision but they don't compare to GLM in terms of coding quality. So I don't think it's because of vision that the frontier models are better, it's more that they are probably just much bigger models.

by freigeist794 hours ago|

parent|

prev|

[-]

it helps giving them a cli vision tool (curl to openrouter vision model for example)

by adrian_b8 hours ago|

prev|

[-]

That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.

With open weights LLMs, it is affordable to use many different models, each for whatever it is better.

Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.

by 0xbadcafebee5 hours ago|

prev|

[-]

Configure a subagent in your coding harness for vision, add a prompt about the vision use, configure a vision model for it, modify your main agent's prompt to use the vision subagent for vision tasks. Now your non-vision model has vision support.

by Havoc7 hours ago|

prev|

[-]

They have a separate VL model but never tried it