If you don't have use of your hands you want that. The whole point of accessibility APIs is allowing arbitrary control of your computer via novel means. One of the big selling points of Dragon Natually Speaking is the ability to tell your computer to do things based on descriptions without a mouse. "open outlook", "click compose", "select subject", "type foo", etc. Unfortunately modern software breaks this a lot. Chrome and anything electron based don't provide any accessibility information to the OS. The interior of the window excluding the tab bar is a void. Yes chrome has an inbuilt screen reader as do a number of electron apps. But if you aren't blind and want to use something like Dragon it doesn't work. Canvas based apps are often the same.
And no the solution here is not computer vision with an LLM. Text and buttons rendered on my computer exist in memory somewhere as text and buttons. We should not need to convert them to pixels and back lossily to recover text and buttons. We should just expose things to the accessibility API and not guess.
Are we sure about this? At least on windows, NVDA works fine with chrome and any electron apps.
Also, even if you hypothetically wanted to use computer vision with an LLM… what API is that LLM going to use to take screenshots and click on stuff?
I want apps to be able to do that!
Is there an opinionated reason not to break out capabilities?
If you have a disability and need tools to use your computer the last thing you want to do is have those things not only off by default but complicated and involved to turn on.