undefined

points

[-]

So there’s a very big difference in the sort of vision approach that browser-use does vs. what we do

browser-use is still strongly coupled to the DOM for interaction because of the set-of-marks approach it uses (for context - those little rainbow boxes you see around the elements). This means it’s very difficult to get it to reliably do interactions outside of straightforward click/type like drag and drop, interacting with canvas, etc.

Since we interact based purely on what we see on the screen using pixel coordinates, those sort of interactions are a lot more natural to us and perform much more reliably. If you don't believe me, I encourage you to try to get both Magnitude and browser-use to drag and drop cards on a Kanban board :)

Regardless, best of luck!