upvote
The killer app was conceived as early as the 1980s: an agent running on your computer, organizing your files, your schedule, your messages, your bills, bank accounts, etc. All the parts of your life that were routine drudgery should be able to be offloaded to a smart agent, based on your preference, to bring you the information you needed with natural language queries, contextualized to what you were doing at the time, when you need it.

What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.

There is no doubt that LLM's can do amazing things, but the current environment seems to make it nearly impossible to do anything with them that doesn't let someone else inspect, influence, and even restrict everything you are doing with with these systems.

reply
> What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.

If we're going to have AI regulation, this is where to start. If a company's AI service acts for a user, the company has non-disclaimable financial responsibility for anything that goes wrong. There's an area of law called "agency", which covers the liability of an employer for the actions of its employees. The law of agency should apply to AI agents. One court already did that. An airline AI gave wrong but reasonable sounding advice on fares, a customer made a decision based on that advice, and the court held that the AI's advice was binding on the company, even though it cost the company money.

This is something lawyers and politicians can understand, because there's settled law on this for human agents.

reply
A few decades back, a lot of computer use was emails. And it was stored on someone else's servers - with everyone from server operators along the route, to the government potentially having access to it. Even HTTPS is a relatively recent thing.

I guess what I'm saying is - we've always had this problem.

reply
Snail mail is also not secure and can be tampered with. I don’t mind someone hosting my mail. But I do mind Google doing it (based on their behavior).
reply
Yea there have always been gaps in privacy, but nowadays it's several orders of magnitude easier for corporations to exploit that private data at scale.
reply
The second half of your comment is a go-to-market concern but doesn't feel so relevant for a research prototype. It could be done with a private local model too, maybe not by Google.

But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.

It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.

reply
This will sound like another brick in the paved road to dystopia but I'm kinda bullish on equipment that can recognize subvocalization. Or at least let me have a small drawing tablet with a stylus (think etch-a-sketch or Wacom Intuos) because at this point I'd rather practice writing and do away with typing altogether (even though I enjoy typing for typing's sake via MonkeyType).
reply
I've been dreaming about that for 20 years. And then use it for people to communicate while sleeping.
reply
Yeah I think there could be something to the integration of AI in an operating system so that it can handle things going on in different applications the same way you can already copy and paste between things.

But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.

reply
It seems that if we ultimately want to "move at the speed of thought," it will require speech.
reply
> It seems that if we ultimately want to "move at the speed of thought," it will require speech.

Except for the large majority of people who read, type, and click way faster than they can talk. Especially for visual things it’s way faster to drag a rectangle than to describe what you want.

A lot of us also aren’t linear verbal thinkers. It would take minutes to hours to verbalize concepts we can grasp visually/schematically in seconds.

Great book on the topic: https://www.goodreads.com/book/show/60149558-visual-thinking

reply
Most people speak at about 150 wpm, but very few can type that fast. But reading and gesturing are fast, which is what TFA is about, combining reading and gesturing with speech.
reply
You rarely need 150wpm when typing. If you try dictation, you’ll notice that half those words are error correction and checksum bits and just turn taking filler.

I usually convey the same meaning with 80wpm typing. Makes it faster to read too

Maybe I’m just slightly adhd – listening to people talk drives my crazy. Get to the point! Much easier if they type it out

reply
> listening to people talk drives my crazy.

People have so many verbal tics and filler words too. Anthropic’s Dario says “you know” after every third word, for example.

Or they meander around unrelated/unimportant details.

reply
There's the adage that writing is thinking, but even more accurately at least for me, editing is thinking.

Neither typing speed nor dictation speed is a true bottleneck, but editing speech seems like it'd be harder than editing text.

Though there may be some hybrid approach that can work well.

reply
> editing is thinking.

I hadn’t realized until just now how accurate that is for me as well. Thank you.

reply
You should look into how often people are using tools like WisprFlow and SuperWhisper. Voice is a very native mechanism. Most people working in open floor plans are wearing headphones any way. As long as you're not screaming, it's probably fine. Maybe, we'll move away from open plan offices in the bid for efficiency, which I would welcome.
reply
I am moving full remote because dictation is such a better input mechanism for most of my AI interactions that I have become less efficient sitting in my open floorplan desk at the office because I cannot dictate there and the latency adds up. Typing is just achingly slow these days.
reply
I feel like I can type faster than I can talk but I could be totally wrong?
reply
I also feel this way, but more importantly, I feel like my sentences are more coherent when typed because typing allows for corrections and modifications of ideas. Do whispr people just … get coherent, finalized ideas out in a single shot without any misspoken words?
reply
transcription gets post-processed by a LLM (with different styles, like based on prompts, so that it removes fillers, homophones, change the style, etc.

I recommend the youtube channel @afadingthought to see what people come up with (like v=283-z29TXeM).

reply
They are not.

It's like a hidden curse of LLMs -- they're so good at parsing intended meaning from non-grammatically-correct language that we don't have to be very good at clear communication.

Eventually all LLMs will be controlled by humans uttering terse gutteral grunts. We will all become neanderthals, with machines that deliver our every whim.

reply
You should look into how often people are using rectangles with buttons on them. They may be a bit archaic, but they are my preferred input method. For example, thanks to rectangles with buttons, the other people in my vicinity do not need to hear about the inane internet arguments I routinely involve myself in.

I dunno how I can express this best, but I found out a very long time ago that my problem with voice input wasn't that it wasn't good enough. My problem with voice input is that I don't want it. I am very happy for people who use these tools that they exist. I will not be them. Yes I am sure.

And yes, I know SuperWhisper can run offline, but it is a notable benefit that versus many modern speech recognition tools my keyboard does not require an always-active Internet connection, a subscription payment, or several teraflops of compute power.

I am not a flat-out luddite. I do use LLMs in some capacity, for whatever it is worth. Ethical issues or not, they are useful and probably here to stay. But my God, there are so many ways in which I am very happy to be "left behind".

reply
I'm sorry but if you think the amount of workers using voice controls in the office to be more than 1% you are in a massive bubble my dude.
reply
It's possible to rely on mouth movements instead of sound. I've been tweaking visual speech recognition models (VSR) for the past few weeks so that I can "talk" to my agents at the office without pissing everyone off. It works okay. Limiting language to "move this" "clear that" along side context cues vastly simplifies the problem and makes it far more possible on device.

I think its brilliant UX.

reply
No UX needs to be perfect for everyone, but this doesn't sound trivial to make reliable.

First things that came to mind:

  - facial hair
  - getting people to learn to make bigger mouth movements and not mumble
  - we're constantly self-correcting our speech as we hear our voice. This removes the feedback loop.
  - non english languages (god forbid bilingualism)
  - camera angles and head movement
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?
reply
>non english languages (god forbid bilingualism)

In my experience, any combination of computers + speech + danish has, so far without exception been terrible. Last time I tested ChatGPT, it couldn't understand me at all. I spoke both in my local dialect and as close to Rigsdansk [π] as I could manage. Unusable performance, and in any case I should be able to talk normally, or there's no point. It was about a year ago - it may have improved but I doubt it. I'm completely done trying to talk to machines.

Pre-emptive kamelåså: https://www.youtube.com/watch?v=s-mOy8VUEBk

[π] https://en.wikipedia.org/wiki/Danish_language#Dialects

reply
Yeah, I'd hate to use this in an open-plan office (which is like 99% of offices these days) and even using it alone at home would feel awkward. I don't really want to talk to the computer despite what 1950's sci-fi books led us to believe.

It's a cool idea for the future when we have reliable EEG headsets or Neuralink or whatever though.

reply
The only place I'd ever talk to a machine is my car. Instead of huge flashy screens that distracts and kills thousands of people maybe they could build a buttons + voice agent system that could actually be useful and durable. I hate to tap Waze/Maps/etc. every time when I go somewhere or that I cannot comfortably switch to specific songs en route without risking my life...
reply
I connect my iPhone to my car and it requires Siri to be enabled which I can then use to change songs, Google Maps destinations etc. without having to touch anything.

The Siri voice transcription is pretty awful compared to what I've experienced with ChatGPT though and it's weird going back almost to the pre-LLM world where you have to give such clear sort of computer-coded voice commands.

reply
>Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.

Reads like the argument against cell phones where don't have a cabinet around you...

reply
I wouldn't sit in the office talking on my phone next to my colleagues, that would be really annoying.

I'd go and find a small meeting room or conference call booth in the office and take it there.

Essentially, a cabinet.

reply
The argument is against human to machine control. Not human to human communication.

In fact, when humans happen to order other humans, it's typically done in writing.

reply
Yes, it does seem kinda ... pointless.
reply
A General-Purpose Bubble Cursor

https://www.youtube.com/watch?v=46EopD_2K_4

>We present a general-purpose implementation of Grossman and Balakrishnan's Bubble Cursor [broken link] the fastest general pointing facilitation technique in the literature. Our implementation functions with any application on the Windows 7 desktop. Our implementation functions across this infinite range of applications by analyzing pixels and by leveraging human corrections when it fails.

Transcript:

>We present the general purpose implementation of the bubble cursor. The bubble cursor is an area cursor that expands to ensure that the nearest target is always selected. Our implementation functions on the Windows 7 desktop and any application for that platform. The bubble cursor was invented in 2005 by Grossman and Balakrishnan. However a general purpose implementation of this cursor one that works with any application on a desktop has not been deployed or evaluated. In fact the bubble cursor is representative of a large body of target aware techniques that remain difficult to deploy in practice. This is because techniques like the bubble cursor require knowledge of the locations and sizes of targets in an interface. [...]

https://www.dgp.toronto.edu/~ravin/papers/chi2005_bubblecurs...

>The Bubble Cursor: Enhancing Target Acquisition by Dynamic Resizing of the Cursor’s Activation Area

>Tovi Grossman, Ravin Balakrishnan; Department of Computer Science; University of Toronto

I've written more about Morgan Dixon's work on Prefab (pre-LLM pattern recognition, which is much more relevent with LLMs now).

https://news.ycombinator.com/item?id=11520967

https://news.ycombinator.com/item?id=14182061

https://news.ycombinator.com/item?id=18797818

https://news.ycombinator.com/item?id=29105919

reply
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
reply
pull up any moderately busy picture with more than a trivial amount of objects. pictures of "traffic" or with other similar repetition are great for this demo. pick one specific object (like a specific tire on one car) in the image and write (or say) out all the words youd need to specify that exact object. now take the same image and point at the object with your mouse or circle it with an annotation tool. its often very very hard to describe accurately which object you are talking about, you will often resort to vague "location" words anyway like "on the upper left" that try to define the position in a corse way that requires careful parsing to understand. pointing/annotating is massively superior both in brevity, clarity, and speed.
reply
I think they answer that question pretty convincingly: Because if what you're looking at is already on the screen, it much more easy to point to it and say "that" than to describe it.

(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)

reply
The "Edit an Image" Demo at the bottom is pretty fun. Maybe this is just Google flexing their LLM inference capacity.
reply
That demo was an absolute disaster for me on Firefox on mac. It just fundamentally didn't work - the voice wasway behind my pointer, there were multiple agents speaking over each other saying conflicting things, and it couldn't even move the crab to the bottom right of the image. Embarassingly bad I would say!
reply
Yup - what google is suggesting here will never materialize beyond being a slopfeature. People who want these bespoke workflows will build them or seek out specific tools that enable them, not trusting some overarching daemon that contextually watches their cursor. I don't trust google one bit to execute correctly on something like this.
reply
Well you see to really, really sell it to the common folks, they need to convince you that ChatBots are the "Intelligence" . So they are coming up with all sorts of crap, like this one. The TV advertisements for Gemini and co. are indicative of how they see the average user, as an idiot of sorts, who needs the shit-device for pretty much anything. Oh you spilled some water on the counter top? Quick, ask Gemini what to do! You are a 20ish something individual home alone? Quick, lay on the couch and ask Gemini if you can really talk to it, omg, its so exciting! You were in holidays all alone, but in the middle of a really large crowd? Gemini to the help, cut those people out and make it look like it was an exclusive spot, just for you! Nobody else was there. So this proposal is going into the same direction - probably targeting the average office "idiot".
reply