For better and worse, 5-10Mi isn't uncommon for a web app.
Instead of trying to go "bottom up" and, effectively, do what a browser engine is doing in reverse, it seems easier to go "top down" like a human does and go off the visual representation.
No most vision models focus on subset of an image at a time when using image -> text
image -> image uses whole image.
Is this true? Where can I read more about it?