upvote
That was my first thought as well. A lot of current web development relies heavily on code generation then has obsfuscation and compression slapped on top leading to complicated structures. Then on top of that, more code (client side/JavaScript) reconfigures everything again. You end up with fairly complicated html/css/JavaScript to wade through.

For better and worse, 5-10Mi isn't uncommon for a web app.

Instead of trying to go "bottom up" and, effectively, do what a browser engine is doing in reverse, it seems easier to go "top down" like a human does and go off the visual representation.

reply
I think you're right, you can get agents to do what we do -- learn how a website works. Then expose that model as a simple API. There will still be some vision tasks for navigation but they will be just vision tasks, no thinking required.
reply
>Is it possible to ask the vision agent to "map"

No most vision models focus on subset of an image at a time when using image -> text

image -> image uses whole image.

reply
> No most vision models focus on subset of an image at a time when using image -> text

Is this true? Where can I read more about it?

reply