Talking to Google Assistant is a real “wow, we’re officially in the future” moment for me, often to the point that it makes me wonder: How do voice-activated virtual assistants work? Specifically, how do they understand what someone is asking, then provide a correct, useful and even delightful response? For instance, a few weeks ago, I was playing around with Assistant before getting to my actual question, which was, naturally, food-related. I said, “Hey Google, what’s your favorite food?” Assistant’s answer was swift: “I’m always hungry for knowledge,” it said. As the cherry on top, the written version that appeared as Assistant spoke had a fork and knife emoji at the end of the sentence.
Assistant can respond to so many different types of queries. Whether you’re curious about the biggest mammal in the world or if your favorite ice cream shop is open, chances are Assistant can answer that for you. And the team that works on Assistant is constantly thinking about how to make its responses better, faster and more helpful than ever. To learn more, I spoke with Distinguished Scientist Françoise Beaufays, an engineer and researcher on Google’s speech team, for a primer on how Assistant understands voice queries and then delivers satisfying (and often charming) answers.
Françoise, what exactly do you do at Google?
I lead the speech recognition team at Google. My job is to build speech recognition systems for all the products at Google that are powered by voice. The work my team does allows Assistant to hear its users, try to understand what its users want and then take action. It also lets us write captions on YouTube videos and in Meet as people speak and allows users to dictate text messages to their friends and family. Speech recognition technology is behind all of those experiences.
Why is it so key for speech recognition to work as well as possible with Assistant?
Assistant is based on understanding what someone said and then taking action based on that understanding. It’s so critical that the interaction is very smooth. You only decide to do something by voice that you could do with your fingers if it provides a benefit. If you speak to a machine, and you’re not confident it can understand you quickly, the delight disappears.
So how does the machine understand what you’re asking? How did it learn to recognize spoken words in the first place?
Everything in speech recognition is machine learning. Machine learning is a type of technology where an algorithm is used to help a “model” learn from data. The way we build a speech recognition system is not by writing rules like: If someone is speaking and makes a sound “k” that lasts 10 to 30 milliseconds and then a sound “a” that lasts 50 to 80 milliseconds, maybe the person is about to say “cat.” Machine learning is more intelligent than that. So, instead, we would present a bunch of audio snippets to the model and tell the model, here, someone said, “This cat is happy.” Here, someone said, “That dog is tired.” Progressively, the model will learn the difference. And it will also understand variations of the original snippets, like “This cat is tired” or “This dog is not happy,” no matter who says it.
The models we use nowadays in Assistant to do this are deep neural networks.
What’s a deep neural network?
It’s a kind of model inspired by how the human brain works. Your brain uses neurons to share information and cause the rest of your body to act. In artificial neural networks, the “neurons” are what we call computational units, or bits of code that communicate with each other. These computational units are grouped into layers. These layers can stack on top of each other to create more complex possibilities for understanding and action. You end up with these “neural networks” that can get big and involved — hence, deep neural networks.
For Assistant, a deep neural network can receive an input, like the audio of someone speaking, and process that information across a stack of layers to turn it into text. This is what we call “speech recognition.” Then, the text is processed by another stack of layers to parse it into pieces of information that help the Assistant understand what you need and help you by displaying a result or taking an action on your behalf. This is what we call “natural language processing.”
Got it. Let’s say I ask Assistant something pretty straightforward, like, “Hey Google, where’s the closest dog park?” — how would Assistant understand what I’m saying and respond to my query?
The first step is for Assistant to process that “Hey Google” and realize, “Ah, it looks like this person is now speaking to me and wants something from me.”
Assistant picks up the rest of the audio, processes the question and gets text out of it. As it does that, it tries to understand what your sentence is about. What type of intention do you have?
To determine this, Assistant will parse the text of your question with another neural network that tries to identify the semantics, i.e. the meaning, of your question.
In this case, it will figure out that it’s a question it needs to search for — it’s not you asking to turn on your lights or anything like that. And since this is a location-based question, if your settings allow it, Assistant can send the geographic data of your device to Google Maps to return the results of which dog park is near you.
Then Assistant will sort its possible answers based on things like how sure it is that it understood you correctly and how relevant its various potential answers are. It will decide on the best answer, then provide it in the appropriate format for your device. It might be just a speaker, in which case it can give you spoken information. If you have a display in front of you, it could show you a map with walking directions.
To make it a little more complicated: If I were to ask something a bit more ambiguous, like, “Hey Google, what is the most popular dog?” — how would it know if I meant dog breed, dog name or the most popular famous dog?
In the first example, Assistant has to understand that you’re looking for a location (“where is”) and what you’re looking for (“a dog park”), so it makes sense to use Maps to help. In this, Assistant would recognize it’s a more open-ended question and call upon Search instead. What this really comes down to is identifying the best interpretation. One thing that is helpful is that Assistant can rank how satisfied previous users were with similar responses to similar questions — that can help it decide how certain it is of its interpretation. Ultimately, that question would go to Search, and the results would be proposed to you with whatever formatting is best for your device.
It’s also worth noting that there’s a group within the Assistant team that works on developing its personality, including by writing answers to common get-to-know-you questions like the one you posed about Assistant’s favorite food.
One other thing I’ve been wondering about is multi-language queries. If someone asks a question that has bits and bobs of different languages, how does Assistant understand them?
This is definitely more complicated. Roughly half of the world speaks more than one language. I’m a good example of this. I’m Belgian, and my husband is Italian. At home with my family, I speak Italian. But if I’m with just my kids, I may speak to them in French. At work, I speak English. I don’t mind speaking English to my Assistant, even when I’m home. But I wouldn’t speak to my husband in English because our language is Italian. Those are the kinds of conventions established in multilingual families.
The simplest way of tackling a case where the person speaks two languages is for Assistant to listen to a little bit of what they say and try to recognize which language they’re speaking. Assistant can do this using different models, each dedicated to understanding one specific language. Another way to do it is to train a model that can understand many languages at the same time. That’s the technology we’re developing. In many cases, people switch from one language to the other within the same sentence. Having a single model that understands what those languages are is a great solution to that — it can pick up whatever comes to it.