The wake word + invocation magic spell required to use assistants has been a usability, adoption, and discoverability problem from the get-go in the evolution of conversational tech. People are experts in asking for what we want. We do it many times a day, in many different ways. Wake word + invocation not only violates our well-developed abilities, it requires us to do so in a way that is unlike almost any method we use naturally. Even if we do use names and phrases to get someone's attention, we do not normally do so in ways such as "Hey Paul, tell Arun to send out the meeting notes" when both Paul and Arun are present and able to pay attention.
In this thought-provoking article, Hilary Hayes encourages us to explore moving beyond these unnatural and cognitively challenging language structures. She posits that multiple common human interactions, such as gaze detection, could be used for or inspire techniques that enable assistant hardware to know when a person is about to speak to it.
By the end of 2020, I anticipate voice design moving away from explicit wake words (removing the need for saying “Alexa” or “Hey Google”) to implicit wake conditions, such as recognizing when a user is looking at or turns to face a device housing the assistant.
We think this is an idea worth exploring deeply, along with techniques like direct or wake-level invocation. Imagine being able to simply say, "Hey ESPN, what's the latest in the NFL?" and your assistant hears and routes that appropriately. While the need for intermediation in the early days of voice assistants made some sense, overall technology-based services have been moving away from mediated interaction for decades. Voice assistants should follow suit.