How to talk to your robot
In the latest Keynote from Google, Sundar Pichai played several phone calls made by Google Assistant to small businesses to make appointments. The ability of the tech (called Google Duplex) to accurately assess the open-ended language on the other end of the line is remarkable. Even more astonishing is the way the assistant speaks: a few ums and ahs, pauses when apparently thinking and a “gotcha” thrown in for good measure. Those answering the calls do not seem to suspect for a moment that they are not speaking to another human.
The capability of technology to correctly interpret and respond to diverse, open-ended human statements is clearly a remarkable achievement. It will unlock a vast array of voice-enabled technological solutions. Sounding like a human, on the other hand, is a can much more full of worms. That goal should be viewed neither as inevitable nor inherently ideal.
Reading about artificial intelligence and smart assistants, we frequently encounter the assumption that these technologies will inevitably become more and more human-like. A futurist proposes that AI assistants will become our friends, emotionally attuned and emotionally responsive, and furthermore that it is just a matter of technological development. A journalist makes an attempt at such a friendship as an experiment. At a minimum, it is assumed that devices will sound and seem more and more like human beings.
Putting aside the question of technological capability, there is reason to doubt the widespread desire for machines that create the impression that a human mind or being is on the other side of the interaction.
Studying the ways people interact with voice devices on a recent project, we learned that many users actually appreciated some of the un-human aspects of their devices.
Robotic speech, mispronunciations, repetition of identical questions, failure to recognize humour or sarcasm: these exhibits of mechanical, un-human underpinnings are indeed faults insofar as they prevent the successful completion of a task. But if they do not inhibit the immediate goal of the user, they can give users peace of mind and reassurance.
In fact, these supposed shortcomings remind users that there is no human on the other side of the interaction. Instead it is a tool they can use and manipulate to their own ends without guilt. On the other hand, when the devices expressed too much familiarity or human-ness, they become creepy and invasive.
While robotic lapses can be frustrating, it is the limits of the technology that remind users what really lies behind the nondescript voice and smooth plastic of the devices.
We heard this appreciation for robotic expressions voiced by device owners in the US and UK but not in China. The Chinese users we spoke with were more likely to view technological advances in general with unambiguous optimism and excitement. Ground-breaking gadgets were taken for granted as something fun and helpful, as symbols of success and economic growth.
In the West, feelings about artificial intelligence tend to lie at the extremes. Either it will create a utopia of abundance without work or it will destroy society – nothing in the middle. Such extreme views are deeply rooted in Western popular culture. It can be seen at least as far back as Shelley’s Frankenstein, even earlier if we consider beings like the golem or the alchemists’ homunculi. Futurists have both feared and celebrated the prospect of new technologies replacing human work for at least 500 years.
Nearly all AI assistants today have an associated persona. Devices and their makers have come under scrutiny for the fact that the majority have feminine names and default to female voices. While motives for pushing a persona can be myriad, two are likely significant. Firstly, developers may believe that personas will make assistants more ‘natural’ and easy to use. Secondly, a persona may advance the impression that the technology has achieved human-level intelligence or something very close to it.
Users do tend to anthropomorphise their smart assistants. They’ll say that Alexa “gets tired” or “she doesn’t like complicated requests that she can’t understand.” No doubt this is influenced by the personas created by the manufacturers as well as depictions of AI in popular culture. But even users who say “please” and “thank you” to the device will acknowledge that they do so out of habit and that it makes no difference to the technology or effective use thereof.
In certain areas, however, Google and Amazon have drawn a personality line that their assistants will not cross. AI devices are quite noticeably devoid of any strong opinions. It is hard to elicit an offensive statement even if you try. The jokes they deliver are perfectly vanilla without exception. In fact, the AIs of today in some ways have less personality than early predecessors like SmarterChild on AIM (now ancient in AI terms).
These cautious limitations are in place because with offensive statements and dirty jokes it becomes abundantly clear: people don’t actually want human-like machines with human-like personalities.
“I speak, therefore I am”
A deep-rooted Western conception may be behind the drive to produce ever-more realistic simulations of human-ness in technology:
‘the idea of voice as guarantor of truth and self-presence, from which springs the familiar idea that the voice expresses self and identity and that agency consists in having a voice’ (Weidman 2014a: 39). Linguistic anthropologist Miyako Inoue (2003: 180) has summarized this idea as ‘I speak, therefore I am.’ The idea here is that the voice is a direct expression of a person’s intimate emotions and opinions, which renders the act of speaking an expression of human agency and, in certain contexts, resistance. (Schäfers 2017)
It follows that where there is a voice, there is a self: thoughts, agency and personality. Those working on smart assistants may be pushing the impression of personality and an underlying self because they believe that is what users will be most comfortable with. But this relationship between self and voice is not a necessary one. Schäfers points to the work of Judith Irvine with the Wolof:
Wolof speakers in Senegal have at their disposal two different ‘registers’ or styles of speaking that are connected to social status and situation. … [T]hese registers or ‘voices’ are not inherent properties of individual speakers but strategically employed in order to mark relative status difference in a particular context.
Registers of speaking therefore can still communicate messages and be consistent in tone without implying an underlying personality. We can imagine a mode of speech which doesn’t represent a type of ‘self’ but instead represents the relation between machines and human interlocuters. As the technology evolves and works its way into more and more daily life, the speech of machines could include words, syntax or tenses unique to machine-to-human communication. Usted for the user, tú for the device if you will.
It may not just be the machines that speak a new register either. Humans could also develop distinct ways of speaking to machines. There are signs of this already starting to take place. The most obvious is the usage of triggers like “Hey Google” to initiate an interaction. We can hear trigger words in human-to-human interaction (“Jen, I was wondering…”) but it is far less frequent than when a user is speaking to a device.
The trigger action presents an initial application for new human-to-machine linguistics. Echo devices are constantly triggered unintentionally by sounds similar to “Alexa.” One solution would be to use a non-word sound to trigger an interaction – a more distinct sonic marker. Alternatively, a regular word could be pronounced in a different tone to trigger devices. The ability of voice devices in China to interpret tonal language is a capacity that could be adopted for use with non-tonal languages.
Entirely new dialects and ways of speaking could evolve for human-machine communication. As artificial intelligence becomes more ubiquitous and changes the ways in which we interact with technology, new linguistic modes of communicating with machines isn’t that far-fetched.
Most current writing on smart assistants assumes that the ideal form of human-machine communication is a replication of human-to-human communication. Some users do imagine speaking to their devices this way in the future. But in many cases users don’t want machines to seem like humans simply because they aren’t humans.
It is very human for language to reflect the social categories of the speaker and the spoken-to. This requires a new argot in the case of machines, which have only just begun to play these roles.
In the long run, developers shouldn’t worry too much about making devices speak like humans. Having machine speech which is distinct from human speech is actually a good strategy and not evidence of defeat. It is important, however, to work towards distinct machine-human communication which makes the interaction easier and which reflects the categories and expectations of the user.
More from the Stripe archive:
The AI Mirror – what depictions of AI in popular culture say about us
Who is Alexa? – how different users see Alexa in different lights
Menus, Mental Maps and Voice – on the difficulties of the shift to spoken interfaces
Learning to Live with Alexa – initial reflections on Amazon Echo written in late 2015