Episode 25 of The Informed Life podcast features an interview with linguist Mary Parks. For almost twenty years, Mary has worked as a voice user interface designer for several digital technology companies, including some of the field’s leaders. Our conversation focused on what it takes for digital systems to parse, understand, and generate speech.
One fascinating aspect of voice recognition systems is how they separate the audio signal of an utterance from the content it carries — it’s “text.” For example, as Mary put it, the system doesn’t know if you’re yelling at it, only what you’re saying. But this audio signal carries with it a lot of important information as well:
The moment we open our mouths, a massive amount of identifying information is in the speech utterance, in the first two seconds of the utterance. Whenever we talk, there’s a ton of information there. You hear things in the in the sound of the voice that tell you who the person is, elements of their identity, including perhaps the region they’re from. You know, there’s just all kinds of things that come up. And if you know the person, then your brain goes, “Oh, I know this voice.” Like you can hear only just to the two seconds of a voice, and if it’s somebody you really know, you’ll know who it is right away with pretty high confidence as a person. And so just identity and language are deeply tied.
I wish Mary and I had talked longer — there was much in our conversation I wanted to follow up on. I hope you get as much value from this episode as I did.
(By the way, in case you missed it: the show is now available on Google Play Music. This should make it easier for folks who use Android devices to listen.)