Getting voice assistants to speak Slovakian first means getting better AI learning.

Apple’s most recent fall event centered on excitement about the iPhone X, face recognition replacing Touch ID, OLED displays, and a cellular-enabled Apple Watch. But instead of “one more thing,” people living in Poland, Lithuania, Slovakia, Czech Republic, and many other places all over the world certainly noticed one missing thing.

Siri learned no new languages, and it’s kind of a big deal.

Touch screen works splendidly as an interface for a smartphone, but with the tiny display of a smartwatch it becomes a nuisance. And smart speakers that Apple wants to ship by the end of the year will have no screens at all. Siri—and other virtual assistants like Google Assistant, Cortana, or Bixby—are increasingly becoming a primary way we interact with our gadgets. And talking to an object in a foreign language at your own home in your own country just to make it play a song makes you feel odd.
Believe me I tried. Today, Siri only has 21 supported languages.

A quick glance at the Ethnologue reveals there are more than seven thousand languages spoken in the world today. Those 21 that Siri has managed to master account for roughly half of the Earth’s population. Adding new languages is subject to hopelessly diminishing returns, as companies need to go through expensive and elaborate development processes catering to smaller and smaller groups of people. Poland’s population stands at 38 million. Czech Republic has 10.5 million, and Slovakia has just 5.4 million souls. Adding Slovakian to Siri or any other virtual assistant takes just as much effort and money as it takes to teach it Spanish, only instead of 437 million native Spanish speakers, you just get 5.4 million Slovakians.

While details vary from Siri to Cortana to Google et al, the process of teaching these assistants new languages looks more or less the same across the board. That’s because it’s determined by how a virtual assistant works, specifically how it processes language.

So if Siri doesn't talk to you in your mother tongue right now, you’re probably going to have to wait for the technology driving her to make a leap. Luckily, the first signs of such an evolution have arrived.

Step one: Make them listen

“In recognizing speech you have to deal with a huge number of variations: accents, background noise, volume. So, recognizing speech is actually much harder than generating it,” says Andrew Gibiansky, a computational linguistics researcher at Baidu. Despite that difficulty, Gibiansky points out that research in speech recognition is more advanced today than speech generation.

The fundamental challenge of speech recognition has always been translating sound into characters. Voice, when you talk to your device, is registered as a waveform that represents how frequency changes in time. One of the first methods to solve this was to align parts of waveforms with corresponding characters. It worked awfully because we all speak differently with different voices. And even building systems dedicated to understanding just one person didn’t cut it, because people could say every word differently, changing tempo, for instance. If a single term was spoken slowly then quickly, this meant the input signal could be long or quite short, but in both cases it must translate into the same set of characters.

When computer scientists determined that mapping sound onto characters directly wasn’t the best idea, they moved on to mapping parts of waveforms onto phonemes, signs representing sounds in linguistics. This amounted to building an acoustic model, and such phonemes then went into a language model that translated those sounds into written words. What emerged was a scheme of an Automatic Speech Recognition (ASR) system with a signal processing unit, where you could smooth input sound a little bit, transform waveforms to spectrograms, and chop them down into roughly 20 millisecond-long pieces. This ASR also had an acoustic model to translate those pieces into phonemes and a language model whose job was to turn those phonemes into text.

“In the old days, translation systems and speech to text systems were designed around the same tools—Hidden Markov Models,” says Joe Dumoulin, chief technology innovation officer at Next IT, a company that designed virtual assistants for the US Army, Amtrak, and Intel, among others.

What HMMs do is calculate probabilities, which are statistical representations of how multiple elements interact with each other in complex systems like languages. Take a vast corpus of human-translated text—like proceedings of the European Parliament available in all EU member states’ languages—unleash a HMM on it to establish how probable it is for various combinations of words to occur given a particular input phrase, and you’ll end up with a more or less workable translation system. The idea was to pull off the same trick with transcribing speech.

It becomes clear when you look at it from the right perspective. Think about pieces of sound as of one language and about phonemes as of another. Then do the same with phonemes and written words. Because HMMs worked fairly well in machine translation, they were a natural choice for moving between the steps of speech recognition.

With huge models and vast vocabularies, speech recognition tools fielded by IT giants like Google or Nuance brought the word error rate down to more or less 20 percent over time. But they had one important flaw: they were the result of years of meticulous human fine-tuning. Getting at this level of accuracy in a new language meant starting almost from scratch with teams of engineers, computer scientists, and linguists. It was devilishly expensive, hence only the most popular languages were supported. A breakthrough came in 2015.

Step two: Use the deep-learning revolution

In 2015, all of a sudden, Google’s speech recognition system surprised the world with an astounding 49-percent performance jump. How could this system so rapidly go from a 20-percent error rate to a five-percent error rate? Deep learning had really kicked in.

Deep neural networks—algorithms mimicking the human brain—tapped in to both big data and powerful hardware. Of the three traditional ASR modules outlined above, DNNs replaced the most challenging and work-intensive: acoustic modeling. Predicting phonemes was no longer necessary. Instead, ASR systems could go straight to characters from raw spectrogram frames so long as the system first ingested hundreds of thousands of hours of recorded speech. (This is why dictation service precedes a virtual assistant—dictation is eventually where big data for DNNs come from, allowing the real, self-improving acoustic models to take shape.) Companies needed little to no human supervision to do that, and the systems improved over time.

There were minor hiccups. These systems could not predict how to spell a word they hadn't seen before, in most cases proper names or neologisms. But instead of failing completely, they were able to deal with it in a very humanlike manner: they were spelling new words phonetically. And a phonetic transcription was a piece of cake for HMM-based language models that simply assigned low probabilities to sequences like “try cough ski concerto.” The system concludes “Tchaikovsky” to be way more likely.

Compare this to what Apple’s Alex Acero told Reuters this spring when he described how Apple began working on Siri learning Shanghainese. First, the company invited native language speakers to read passages in a variety of dialects and accents and have computers learn from transcribed samples. One of the problems that surfaced at this stage was that people reading such passages in a studio setting often sounded dull and unemotional—it wasn’t their natural way of talking.

Tech companies use some clever tricks to get around this, like fitting speakers with headphones playing background sounds of a crowded cafe or shopping mall. To get speech more lively, engineers fiddled with passages to read. Having speakers read poetry, good literature, or a movie script led participants to start voice acting. And from there, with a sound editing software, you can add all kinds of noise to the samples, like wind, the running engine of a car, distant music, other people talking. It all helps get the samples as close to real-world data as possible.

If this sounds like a pre-deep-learning way of building ASRs, rightly so. Apple, still known for its perfectionism, likely proceeds with care trying to get their systems as fine-tuned as possible before deployment, which means they probably rely on human-reliant transcriptions more than they need to. By contrast, Google has recently shown what deep learning can really do for this field. This past August, Google dictation added 21 new languages, extending its support to a staggering total of 119 languages.

Step three: Algorithmic understanding

Understanding your utterance no matter the language—which is the whole goal of these complicated ASR systems—is merely the first part. A virtual assistant needs to do something about it. Such query understanding is usually done in three steps, and the first is called domain classification. To start, an AI essentially tries to figure out what category the requested task falls into. Does it have something to do with messaging, watching movies, answering factual questions, giving directions, etc.?

Which domain an assistant ultimately goes for usually depends on whether it can find some specific keywords, or combinations of keywords, in the text it has been given. When we say something like “play the trailer for the movie starring Johnny Depp and featuring Caribbean pirates,” an assistant will simply calculate how probable it is that, given the input contains words like “movie” and “trailer” and “starring” placed close to each other, she should go for a “movie” domain.

Once the domain is figured out (if it isn’t, you simply get “I’ll search the Web for that” response), a virtual assistant goes on to intent detection. It comes down to what action you want your assistant to take. Since we’re in “movies,” the presence of the word “play” makes it quite probable we want it to open a video file. The last remaining problem is which one.

To make its guess, Siri resorts to semantic tagging or slot filling. Let’s say that to find the right trailer we need to fill in slots like “title” or “actor,” maybe “plot” when we can’t remember the title exactly. Here, Siri would simply find it most probable that, two previous steps considered, Johnny Depp is an actor and that the word “Caribbean” placed right beside “pirates” hints at the latest installment of a popular franchise. All a virtual assistant can do is group such defined intents along with sets of keywords hinting at them. Amazon’s Alexa supports roughly 16,000 of these. Dumoulin’s Next IT (the company that designed assistants for the US Army, Amtrak, and Intel) has recently released a set of tools for businesses to build their own virtual assistants that contains a staggering 90,000 intents.

At first glance, it seems like a nightmare to translate all of this when localizing an assistant in another country. Yet, this is not the case. This way of processing input text means virtual assistants' brains are not that much of an issue when it comes to supporting multiple languages. “In translation systems you use the word error rate. You measure the number of deletions and insertions and incorrect translations in the output,” says Dumoulin. “What we do is look at the number of concepts we deleted or inserted in the process. That’s why one language model can work with other languages, even though the translation may not be perfect. As long as an assistant correctly recognizes concepts, it works smoothly.”

According to Dumoulin, it would even be possible to machine-translate intents and still get reasonably good results. “It’s one of the viable solutions,” he says. At Next IT, the first step in adding a new language is running intents and their appropriate keywords through machine translation. “Then we work with linguists and experts to perfect the translation.” However, this last step is only necessary because Next IT builds assistants to work in specific domains with their own professional jargons. “There’s usually not enough of such domain-specific text documents for machine translation to work reliably, but general-purpose assistants are more generic. Type ‘book me a flight for Sunday’ in Google Translate and it will get it right in every language,” Dumoulin says.

This industry-specific anecdote shows the struggles for machine learning: localizing an assistant, not just translating it, means taking cultural factors into account. It may appear easy—figuring out Brits call football what Americans call soccer shouldn't be that hard after all—but this problem can go deeper than that.

“There is a specific phrase people in Portugal use when they are answering a phone. They say something that means ‘who’s speaking.’ We don’t say anything like that in the US, as this would be considered rude, but over there it’s nothing, something like ‘hello,’” says Dumoulin.

So, a truly conversational AI has to know about such nuances for a given language and culture and realize it’s just a manner of speech and not a literal request. Fishing out such local peculiarities and having a query-understanding module localized in a new language takes 30 to 90 days, depending on how many intents a virtual assistant needs to cover, according to Dumoulin. The upside here is that since Siri and the other most popular systems can be employed by third-party app developers, the burden of localizing intents lies mostly on companies that want Siri to work with their services in a given language. Apple requires Siri-friendly developers to include keywords and examples of phrases prompting Siri to trigger their apps in all languages they wish to support. This makes the localization possible to crowdsource in a way.

So, recognizing speech and understanding language are both currently possible (with workable time, resource, and cost commitments) for multiple languages. But that’s not where virtual assistants end—once an assistant is done processing our queries, it has to communicate the results to us. Today, that’s where things go south for less popular languages.

Talk to me

“To generate speech, Siri and other such systems use concatenation models,” says Gibiansky, the Baidu computational linguistics researcher. Concatenation means “stringing together,” and, in speech-generation systems, what’s being strung together are basic sounds of human voice. “Those things are incredibly complex. One way to build them is to invite a collection of experts—linguists to work with a phoneme system, sound engineers to go for signal processing, and lots and lots of other people working at every detail. It’s very complicated, time-consuming, and expensive.”

Gathering a team of experts with such sophisticated skills specialized in English and other widely used languages is well within reach of big tech companies like Apple or Google. But try to find someone who can do the same thing in Polish, Slovakian, or Sudanese, and you’ll instantly find yourself in a world of trouble. Yet, concatenation models are worth the effort because they offer the best possible naturalness and intelligibility of synthesized speech.

After hiring voice actors so native speakers of a new language can lend their voices to a virtual assistant, the first thing to be done is to build the right script. Let’s take Siri. “There are noticeable variations in quality of Siri’s speech synthesis,” Gibiansky says. “When a given word is present in her database, the voice actor actually did say it during recording sessions, it sounds very natural, the quality is flawless. But when it’s not the case, the system has to concatenate. Concatenation means stringing together such words from basic building blocks of speech—phonemes, diphones, half-phones, and so on. The quality goes down.” So, the choice of a script depends on what an assistant is supposed to do. And for a general-purpose system like Siri, you need to cover a wide range of conversational speech.

Once voice actors are done with recording, you end up with two files. One is a text file with a script they were reading; the other one is a speech file which contains the audio. At this stage, linguists and other experts in a given language need to go through the speech file and align it with a text file on multiple levels (entire paragraphs, sentences, words, syllables, phones, all of which become speech units of the file).

The time and effort that goes into this process depends on the quality you aim for. A TTS system working solely with phones is quite simple. There are around 50 phones in English, Hindi, and Polish. Getting all of them right takes an hour of audio or so. But the resulting speech, built with no consideration of how one phone transitions into the other, is downright awful. It sounds as robotic as it can possibly get. To make it more natural you need to go for diphones, units of speech that consist of two connected halves of phones. Suddenly, the number of your speech units grows to anywhere between a thousand and two thousand.

The voice at this point becomes better, but it’s still not what you want your more demanding users to hear. That’s why most modern TTS systems rely on triphones, phones with a half of a preceding phone at the beginning and a succeeding phone at the end. But sound engineers and linguists are not done with just a database of triphones. They also need to come up with an elaborate set of prosody rules describing patterns of stress and intonation in a given language. Perfecting the voice these services use to talk back to users “can take several months of hard work,” says Gibiansky. That’s why he and his colleagues at Baidu are working on a way around it—they want deep learning to revolutionize speech synthesis in the same way it revolutionized speech recognition two years ago.

Neural voice

Back in March, a team of researchers at Google led by Yuxuan Wang published a paper on a new TTS called Tacotron. They claimed it was the world’s first end-to-end TTS system, and by end-to-end they meant you just had to give it text and speech pairs and it could learn to speak any language by itself. Tacotron managed to master English with just 21 hours of such transcribed audio. Its design principle could be traced back to another development Google had introduced: sequence-to-sequence neural translation.

To translate a text from one language to another, a neural network takes a sequence of signs in a source language and predicts how a corresponding sequence of signs in a target language should look. Words are ascribed numerical values and become signs in longer sequences like phrases, sentences, or entire paragraphs. Thus a sentence like “Little Mary wants an ice cream” in English would be first changed into a sequence of signs like “123456” where “1” stands for “little,” “2” stands for “Mary,” and so on. When translating to Polish, the system would try to guess the corresponding sequence of signs in Polish and probably come up with something like “Mała Mary chce loda” where “1” stands for “Mała,” “2” stands for “Mary,” “3” stands for “chce,” etc. Neural translation algorithms learn by analyzing huge numbers of pairs of thus-arranged sequences in source and target language. And just like in the old days, once a new technique took hold in machine translation, it began making inroads into speech recognition and generation as well.
The Tacotron team basically treated speech as yet another target language to translate a written text into. The beginning of the process looked more or less the same, with one key difference being a sign was no longer defined as a whole word, rather it was a single character. (So, “1” stood for “a,” “2” stood for “b,” and so forth.) A single word ceased to be a sign and became a sequence. Think of this as getting to the higher resolution in algorithms’ understanding of language. Character-level resolution yields better results than lower, word-level resolution, but it takes more computing power.

For speech, the Tacotron team defined a sign as a single spectrogram frame lasting roughly 20 milliseconds. The rest worked just as it does in neural translation; a sequence of signs (characters) in text at the input was translated into a sequence of signs (spectrogram frames) at the output. This system’s learning process worked the same way as well: Tacotron learned by analyzing pairs of such sequences.

The results were fantastic. It was sensitive to punctuation, got stress and intonation surprisingly right, and could figure out how to pronounce words that were not present in its training database. You can hear Tacotron’s voice here—it learned all this after just a few hours of training.

“The exciting thing about deep-learning-based systems is that they really just need data. You can solve the problem with generating speech once, and for all further languages, further voices, you can apply the same exact mechanism,” Gibiansky says. “We can have hundreds of languages and thousands of voices, and the whole thing together can cost a lot less money and effort than just one of the non-neural text-to-speech systems we have today.”

Gibiansky’s team at Baidu, shortly after Google published the Tacotron paper, unveiled its own system called Deep Voice 2 (hear it here). It took this deep learning application even further. “I would say the Google paper described a new neural network system that, given the 20 hours of an actor speaking, could synthesize speech using that actor’s voice. The improvements that we got on that are two-fold,” he tells Ars. “First we improved part of the Tacotron using the WaveNet system, which significantly improves the quality of the audio. But the real goal we were pursuing was to demonstrate that we didn't need 20 hours from a single speaker.”

Instead, Deep Voice 2 can learn to speak with a particular voice using just 20 to 30 minutes of a single person’s recorded speech. All the rest of its training audio can be gathered from multiple speakers. “Each person in our database had only about half an hour of speech. There were over a hundred of them, different voices, different accents, different genders,” says Gibiansky. After choosing whose voice the system is supposed to mimic, it could learn to speak with that voice by leveraging all the information contained in the remaining speakers’ audio.”It can pronounce a word that has never been said by a person with whose voice it is speaking thanks to commonalities it has learned from other voices,” claims Gibiansky.

Gibiansky and Baidu see this as opening a world of possibilities—not just voice interfaces and assistants in any language, but using deep-learning speech generation as a means of preserving dead languages or as a tool to let others build highly specific TTS systems. “There would be no longer a need for using teams of experts,” he says. “You can imagine creating thousands of different voices in hundreds of languages on demand. This can be very personalized.”

So while I can’t talk to Siri in my own language today, the blueprints for such expansion happening seem to now exist. According to Gibiansky, the state of speech generation is more or less where speech recognition was a few years ago. “We’re two, maybe three years away from getting this technology to production level,” he says. “And once we get there, you’ll see an explosion of text to speech systems for all possible languages.”

Perhaps that’s roughly how long half of the world is going to have to wait for Siri to finally talk to them.