I vividly remember the month that I first got Siri on iPhone, not least because I’ve barely used it since the initial novelty wore off. Now, The Economist is getting excited about voice technology, and even has a special report about Siri, Alexa, and the future of natural language voice computing. For my part, I’m deeply unimpressed not just with voice input technologies, but with the whole idea that natural conversation should be any kind of exemplar for interaction with computers (excepting, of course, for people who rely on voice input for accessibility reasons, or for the increasingly rare newcomers to IT). Here are some quick reasons why –
(1) Our interactions with computers tend to be highly dynamic, relying on constant feedback. What I mean by that is – when I’m engaging with a computer, I typically ask lots of quick bad questions and use the answers to guide me. I’d prefer to look quickly in three possible locations for a file rather than ask, “Siri, can you look in Downloads, or possibly Documents, or possibly Desktop, for a file that could be called Latest Draft or Friday Draft or January Draft? I’ll know it when I see it, so just list all the possible contenders”. The ability to quickly interact with computers and maintain a steady flow of information between user and interface seems like an advantage of using machines compared to speaking to humans, which tends to consist in relatively long spoken messages bouncing back and forth.
(2) A lot of the formal syntax we use when interacting with computers is very clear and powerful compared to natural language. If I google ‘site:www.economist.com “cecil rhodes” -oxford’, I’ve succinctly asked google to show me articles about Cecil Rhodes from the Economist.com that make no mention of Oxford. Once you get into more complex searches and operations, the advantages of formal operators become even starker. Of course, you can integrate a similar syntax into Siri or Alexa, but that constitute a different (albeit compatible) vision of how voice computing is going to go.
(3) Voice input is likely to remain relatively unreliable compared to text input. If I can type “indonesia population” and get an immediate answer, why would I bother vocally asking Siri for the answer? Even if Siri is great at recognizing my accent through my heavy cold, I could cough or sneeze or mispronounce a word or simply get distracted for a second, and waste a valuable 6 seconds of my time.
(4) Finally, on dictation specifically – while I understand that many people hate typing, I find it massively surpasses speaking as a way of composing stuff that is *intended to be read*. As I write, I see how stuff will look to the reader; the phrase that sounds just fine with the benefit of vocal nuance and intonation might look hamfisted and obscure as words on a page. That joke that comes across brilliantly in speech may fall flat in text. Of course, you can read back what’s on the page to yourself as you dictate it, but given how different the pragmatics of speech and text are to begin with, I find it far better to skip out the spoken element all together and frame the process of composition as a dialogue between writer and reader, rather than speaker and reader.