First, watch this video and laugh:
Writing a Perl program with text-to-speech
Cue discussion of how voice-activated interfaces are a terrible idea, that real computers will never work like computers on Star Trek, that Microsoft’s UI designers were consumed with hubris if they ever thought this would work, etc. etc.
Too pessimistic! Thing is, there’s actually a very simple solution to this problem. It doesn’t even rely on any exotic technologies. We could do it today.
Things that a human says within a computer’s hearing might be:
- commands (“Do this”)
- content (“Put this text into my document”)
- noise (things not intended for the computer at all, such as conversations between humans)
The very same words can be any of these three categories, depending on the intent behind them. Human language is ambiguous! That’s why humans rely on so many non-verbal cues, like tone of voice and facial expression, to interpret what other humans are saying.
Computers can’t interpret like that. But they wouldn’t have to if we just had a microphone with a couple of buttons on it.
- Hold button 1 and talk: The software interprets your speech as commands.
- Hold button 2 and talk: The software interprets your speech as content.
- Hold neither button: The software ignores anything you say.
I’m gonna go out on a limb here and predict that we’ll see a decent voice-activated system within the next few years, that will rely on a non-verbal communication channel (such as a few buttons) to help resolve the ambiguity of speech.
There will still be plenty of applications where people would rather type than talk (think about all the reasons why people would use text messaging on cell phones instead calling up and talking to someone), but once the novelty wears off I think speech-based interfaces will soon be seen as one more useful tool among many.
July 18, 2008 at 1:38 am
Your idea makes sense, but the execution is problematic. Voice recognition is most useful when you don’t have the ability to use your hands. Tying them to the microphone kills that benefit. However, there could be some interesting solutions to get you in and out of modes with non-verbal sounds. For example, whistling with an ascending pitch could put you into a mode and a descending pitch could take you out. Clicking your tongue could also be a trigger. Other possible solutions could be speaking in different pitches, speaking in different directions (with stereo microphones), or even visual triggers such as closing an eye to signal different modes.
July 18, 2008 at 9:08 am
But could you imagine clicking, whistling and winking at your computer in an office? People may get the wrong impression about what you’re doing
In the Star Trek world they got around this by saying “computer” before everything, in pretty much the same way we call someone by name if we want to talk to them. Although quite why Captain Picard used to say “Tea, Earl Grey, Hot” at his replicator all the time is lost on me. If I were him I’d have that stored as a macro so I could just say “tea”.
It’s perhaps best computers don’t understand what we say. Microsoft Word would have uninstalled itself from my PC years ago if it knew just what I was shouting at it
July 18, 2008 at 5:05 pm
I think that the recognition software is a bit to blame here. I remember seeing a video with Dragon Naturally Speaking ( http://www.nuance.com/naturallyspeaking/ ) that the person did the same task in a couple of minutes.
July 21, 2008 at 10:19 am
I think that the key element in the development of this technology will be the new breed of GUI command line tools – the lanchy, quicksilver and ensos of the world.
Speach is simply text and currently computers aren’t operated through text (If you are in the command line, you’re probably using a keyboard anyway). Once these technologies are integrated into a GUI /properly/ and have matured it will be an relatively simple development for them to take their text input from speech rather than keyboard.
Just as you can shorten commands with enso, people will develop short hand ways of speaking. Sylllables and words will shortened, and I can see syllables which are not standard in English being adopted.
July 22, 2008 at 4:24 am
@Mark: “Voice recognition is most useful when you don’t have the ability to use your hands.”
That’s certainly true in some contexts, but it doesn’t have to be a universal rule. A microphone can be just another input device like a mouse, keyboard, or tablet. I would probably argue that’s “most useful” when verbal communication is the natural form of interaction for a task. That probably overlaps a lot with tasks where hand usage is a constraint, but they’re different issues.
July 22, 2008 at 8:54 pm
Justin: Yes, indeed. I think that the reason that voice recognition is right now used mostly in hands-free environments is because the interface kinda sucks, so it’s mostly for specialized situations.
Consider: a microphone with two buttons takes one hand, and has two buttons. That’s a much smoother interface, in principle, than a keyboard which has how many buttons? It’s just that voice recognition seems to be fixated on getting the interface to function like a full human language listener. It might pay to think more in terms of key phrases (and I’m running into game design principles suddenly) which activate specific commands. The microphone as a mouse-like tool (does a series of simple actions very efficiently), not a keyboard (does a great many actions less efficiently.)
yrs–
–Ben
July 23, 2008 at 1:03 pm
A little bit late, but this came to my mind after reading the post:
What about using the camera that most notebooks have above the monitor? If you look strait at it, this could possibly be detected and used – eventually in combination with a command word – to switch recognition contexts. Combine it with hand gestures as mentioned earlier, and you have multiple options for different situations.
There are some face recognition softwares out there, and I guess it could be done without higher wizardry
Just my 2 cents.