upvote
Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"
reply
Are you suggesting it should summarize the image in text or generate it in HTML or something else?
reply
I've been using Google ai studio as a free vision bridge. Gemma 31B is dummy capable at vision and at 1500 rpd its basically unlimited.
reply
I don't see this being such a big gap. There are some use-cases for sure but apart from UX/UI work it is not really needed. Besides, none of the frontier models can replicate actual images - the can approximate at least in my own experience.
reply
One of my tests for a new model is dumping in a screenshot of a web page and seeing if it can recreate it from scratch in HTML and CSS.

Even the local models I run on my Mac are getting surprisingly good at that now.

reply
Using llms to generate docx. Being able to rasterize and review is an important part of the process.
reply
deleted
reply
I had the same reaction with Deepseek V4 ! It would be more useful as a vision model
reply