One thing I am curious about is a hybrid approach where LLMs work in conjunction with vision models (and probes which can query/manipulate the DOM) to generate Playwright code which wraps browser access to the site in a local, programmable API. Then you'd have agents use that API to access the site rather than going through the vision agents for everything.
https://playwright.dev/docs/getting-started-mcp#accessibilit...
I've mentioned several times and gotten snarky remarks about how rewriting your code so it fits in your head, and in the LLM's context helps the LLM code better, to which people complain about rewriting code just for an LLM, not realizing that the suggestion is to follow better coding principles to let the LLM code better, which has the net benefit of letting humans code better! Well looks like, if you support accessibility in your web apps correctly, Playwright MCP will work correctly for you.
Amazing.
Harder to scale if it's doing a lot of them, I suppose.
Most wikis you can mirror locally if you really need to hammer them.
and now the fact that interfaces need to be accessible to agents, not just humans, ironically increases it for humans in return
I think this is very fertile ground - big labs need to use approaches that can work on multiple platforms and arbitrary workflows, and full-page vision is the lowest common denominator. Platform-specific approaches are a really exciting open space!
https://accessibilityinsights.io/
https://learn.microsoft.com/en-us/windows/win32/winauto/insp...
https://github.com/FlaUI/FlaUInspect
and for WPF applications specifically,
i so far haven't found any application that doesn't.
all you're able to get out, as far as i can tell, is the length of the entered password.
https://devblogs.microsoft.com/cppblog/spy-internals/
Obviously, if you can inject code into a process that receives sensitive data, you're already running in a context where all security bets are off.
But with processes you yourself create, you probably can, even without elevated privileges, unless the application takes measures to prevent injection (akin to game anticheat mechanisms), so it seems worth pointing out that there are simple mechanisms to subvert such "protected" fields that don't require application-specific reverse engineering.
Now the argument against this on [reddit](https://www.reddit.com/r/openclaw/comments/1s1dzxq/comment/o...)
"my experience is the opposite actually. UIA looks uniform on paper but WPF, WinForms, and Win32 all expose different control patterns and you end up writing per-toolkit handlers anyway. Qt only exposes anything if QAccessible was compiled in and the accessibility plugin is loaded at runtime, which on shipped binaries is basically never. Electron is just as opaque on Windows as on macOS because it's the same chromium underneath drawing into a canvas. the real split isn't OS vs OS, it's native toolkit vs everything else."
i tend to think of invoke as "an API over macOS apps" tho...
doesn't `invoke finder shareAndCopyLink` read very nicely? :P
in the context of this blog post, the conclusion looks similar though!
"use the whole web like it's an API"
works much better than
"figure out similar or identical tasks from a clean slate every single time you do them"
invoke rather has overlap with Claude's and Codex' computer-use, except the steps are stored/scripted.
webmcp is bottom-up. computer-use & invoke are top-down