In 1996, I picked up the phone on my desk, dialed a 3 digit code, said “I need to fly to Los Angeles on Tuesday morning, returning Wednesday evening”. A couple hours later, an envelope appeared in my inbox with plane tickets, rental car reservation and hotel reservation.
Then every company in the world fired all the secretaries over the course of the next few years to cut costs, and we’ve collectively forgotten that it was ever like that.
1. How much can you spend on this trip? 2. Is first/business class necessary? 3. Is a layover acceptable if it's cheaper? 3a. Is it better to have a 4am flight nonstop or a 7am flight with a layover? 4. Are there preferred airlines? 5. Are there preferred hotel chains? What's the hotel budget? Do you want to pay extra for a nice view? 6. What kind of car should you rent? Is there equipment you'll be handling?
etc...
This is the kind of stuff that's easy(-ish) to communicate by presenting a list of options to a user through an actual interface. It sucks doing it through voice; think of the old phone systems where you had to go through droning "If you would like to rent an SUV, press 1. If you would like to rent a sedan, press 2. To speak to an operator, press 0."
So no, you never had a voice interface for booking flights; you had a human brain to whom you delegated, which is very different.
Ah, so that is indeed the endgame of what I've been seeing, hmm?
With a mouse and keyboard I can switch windows.
With my voice, the computer can’t yet automatically determine if I am dictating a transcription or giving editing commands. What I really need is the interpreter listening to me to intuitively to know whether I am in the equivalent of VI command mode or insert mode.
It is the roadblock to not needing a screen at all, right now I want to visualize whether it understood me correctly because if it didn’t switch from insert to command automatically, I now have all my commands written into my paragraph. I also don’t want to listen to the computer talk back to me to confirm it listened. I want to just keep going, to keep narrating my thoughts and trust it’s doing the right things, not having to check. Having it slowly chime in to repeat that it listened derails my flow and train of thought.
TLDR The future of voice is headless vi.
It can only ever be a linear sequence of input
The 2 dimensional field of a screen and a mouse and keyboard give you extreme amounts of input and allow you to contextualize that input in arbitrary ways that intuitively make sense to people with minimal training. Most people do not need to be taught that "Paste" goes to the active window.
We barely even touch the surface of what is possible through this set of input devices and output and yet we can't even get that level of fine grained and reliable control into touch screen devices and gamepads, let alone a linear stream of pitch.
Voice cannot be a robust interface. It isn't between humans. There's immense nonverbal communication and human communication also relies very heavily on preshared context to actually get that info across in the first place. Even with all that machinery, human voice is generally considered to only carry, regardless of language, 44ish bits per second of data.