What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.
There is no doubt that LLM's can do amazing things, but the current environment seems to make it nearly impossible to do anything with them that doesn't let someone else inspect, influence, and even restrict everything you are doing with with these systems.
If we're going to have AI regulation, this is where to start. If a company's AI service acts for a user, the company has non-disclaimable financial responsibility for anything that goes wrong. There's an area of law called "agency", which covers the liability of an employer for the actions of its employees. The law of agency should apply to AI agents. One court already did that. An airline AI gave wrong but reasonable sounding advice on fares, a customer made a decision based on that advice, and the court held that the AI's advice was binding on the company, even though it cost the company money.
This is something lawyers and politicians can understand, because there's settled law on this for human agents.
I guess what I'm saying is - we've always had this problem.
But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.
It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.
But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.
Except for the large majority of people who read, type, and click way faster than they can talk. Especially for visual things it’s way faster to drag a rectangle than to describe what you want.
A lot of us also aren’t linear verbal thinkers. It would take minutes to hours to verbalize concepts we can grasp visually/schematically in seconds.
Great book on the topic: https://www.goodreads.com/book/show/60149558-visual-thinking
I usually convey the same meaning with 80wpm typing. Makes it faster to read too
Maybe I’m just slightly adhd – listening to people talk drives my crazy. Get to the point! Much easier if they type it out
People have so many verbal tics and filler words too. Anthropic’s Dario says “you know” after every third word, for example.
Or they meander around unrelated/unimportant details.
Neither typing speed nor dictation speed is a true bottleneck, but editing speech seems like it'd be harder than editing text.
Though there may be some hybrid approach that can work well.
I hadn’t realized until just now how accurate that is for me as well. Thank you.
I recommend the youtube channel @afadingthought to see what people come up with (like v=283-z29TXeM).
It's like a hidden curse of LLMs -- they're so good at parsing intended meaning from non-grammatically-correct language that we don't have to be very good at clear communication.
Eventually all LLMs will be controlled by humans uttering terse gutteral grunts. We will all become neanderthals, with machines that deliver our every whim.
I dunno how I can express this best, but I found out a very long time ago that my problem with voice input wasn't that it wasn't good enough. My problem with voice input is that I don't want it. I am very happy for people who use these tools that they exist. I will not be them. Yes I am sure.
And yes, I know SuperWhisper can run offline, but it is a notable benefit that versus many modern speech recognition tools my keyboard does not require an always-active Internet connection, a subscription payment, or several teraflops of compute power.
I am not a flat-out luddite. I do use LLMs in some capacity, for whatever it is worth. Ethical issues or not, they are useful and probably here to stay. But my God, there are so many ways in which I am very happy to be "left behind".
I think its brilliant UX.
First things that came to mind:
- facial hair
- getting people to learn to make bigger mouth movements and not mumble
- we're constantly self-correcting our speech as we hear our voice. This removes the feedback loop.
- non english languages (god forbid bilingualism)
- camera angles and head movement
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?In my experience, any combination of computers + speech + danish has, so far without exception been terrible. Last time I tested ChatGPT, it couldn't understand me at all. I spoke both in my local dialect and as close to Rigsdansk [π] as I could manage. Unusable performance, and in any case I should be able to talk normally, or there's no point. It was about a year ago - it may have improved but I doubt it. I'm completely done trying to talk to machines.
Pre-emptive kamelåså: https://www.youtube.com/watch?v=s-mOy8VUEBk
It's a cool idea for the future when we have reliable EEG headsets or Neuralink or whatever though.
The Siri voice transcription is pretty awful compared to what I've experienced with ChatGPT though and it's weird going back almost to the pre-LLM world where you have to give such clear sort of computer-coded voice commands.
Reads like the argument against cell phones where don't have a cabinet around you...
I'd go and find a small meeting room or conference call booth in the office and take it there.
Essentially, a cabinet.
In fact, when humans happen to order other humans, it's typically done in writing.
https://www.youtube.com/watch?v=46EopD_2K_4
>We present a general-purpose implementation of Grossman and Balakrishnan's Bubble Cursor [broken link] the fastest general pointing facilitation technique in the literature. Our implementation functions with any application on the Windows 7 desktop. Our implementation functions across this infinite range of applications by analyzing pixels and by leveraging human corrections when it fails.
Transcript:
>We present the general purpose implementation of the bubble cursor. The bubble cursor is an area cursor that expands to ensure that the nearest target is always selected. Our implementation functions on the Windows 7 desktop and any application for that platform. The bubble cursor was invented in 2005 by Grossman and Balakrishnan. However a general purpose implementation of this cursor one that works with any application on a desktop has not been deployed or evaluated. In fact the bubble cursor is representative of a large body of target aware techniques that remain difficult to deploy in practice. This is because techniques like the bubble cursor require knowledge of the locations and sizes of targets in an interface. [...]
https://www.dgp.toronto.edu/~ravin/papers/chi2005_bubblecurs...
>The Bubble Cursor: Enhancing Target Acquisition by Dynamic Resizing of the Cursor’s Activation Area
>Tovi Grossman, Ravin Balakrishnan; Department of Computer Science; University of Toronto
I've written more about Morgan Dixon's work on Prefab (pre-LLM pattern recognition, which is much more relevent with LLMs now).
https://news.ycombinator.com/item?id=11520967
https://news.ycombinator.com/item?id=14182061
https://www.media.mit.edu/publications/put-that-there-voice-...
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)