Speech-to-speech translation – have we seen it all?

Speech-to-speech translation – have we seen it all?

It is safe to say that automated translation of written text, also known as text-to-text translation, has reached a mature stage. Converting written translation into human-like speech, known as text-to-speech translation, doesn’t surprise anyone either. But when it comes to speech-to-speech translation, have we already seen it all? Has it already reached its full potential?

There’s a varied array of speech-to-speech translation apps and devices available on the market – from earbuds and smartphone apps to handheld devices resembling TV remotes. They differ by the number of languages they cover, the speed with which they produce translations as well as their ability to perform the expected actions with and without an Internet connection.

Taxis in Dubai, apart from free wif-fi, are now fitted with translation service to allow customers to chat to their driver. Passengers can use the automated interpreting system to translate their language into the driver’s language displayed on the meter screen.

obile translation application concept.Young tourist using mobile phone on blurred people walking at street as background

The access to translation and interpreting has dramatically increased in the last decade thanks to the rise of mobile phones.

While none of these inventions can quite match the fluency, adequacy and cultural touch of translation and interpreting skills of a human linguist, the field of speech-to-speech translation has certainly attracted the attention of major tech players such as Baidu, Google and Microsoft. This goes to show that this area has significant potential; otherwise, tech giants wouldn’t be willing to invest their resources in exploring it.


Baidu is not only China’s top search engine but similarly to Google and Microsoft, it has been one of the leading proponents of artificial intelligence globally.

In autumn 2018, in a research paper titled “STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework” Baidu announced that it was working on a speech-to-speech translation system that works “very much like a simultaneous interpreter”.

This audacious claim applied mainly to the speed with which translation is produced.

Since last year, Baidu has taken the concept from the STACL research paper one step further and in August this year released a “machine simultaneous interpretation service.

However, upon closer inspection, it transpired that Baidu used the term “simultaneous” to label its latest invention in a slightly misleading way.

Baidu’s speech-to-speech translation solution uses what researchers call a wait-k model. It allows users to determine how many words the device will wait before beginning translation.

Wait-3 means that STACL will start translating after 3 words uttered by the speaker, wait-4 means it will start translating after hearing 4 words, etc. This delay is of crucial importance as STACL predicts the words that the speaker might say next based on the words already uttered.

This means that the more delay there is, the more data STACL has to correctly predict the output that will follow. Therefore, the wait time has a direct impact on accuracy and this very fact defeats Baidu’s claim that their solution performs interpreting simultaneously.

Mobile phone with whe Baidu app open on its screen

Baidu hasn’t been the only company guilty of hyping its AI technology. / Editorial credit: Shutterstock.com


In a demo from July 2019, Microsoft showcased its mixed reality and translation technologies creating the appearance that they are capable of having a hologram of a speaker give a speech in a language they don’t speak based on the English input.

The idea is that the speaker delivers a speech in one location in a target language and the hologram delivers the same speech in a different language in a different location.

However impressive the demo, it remains unclear whether the raw machine translation output from Microsoft Translate has been edited by a human before being used in the presentation or how stable the hologram is in remote locations.


Google, on the other hand, has been working on breaking the traditional model where speech-to-speech translation is split into three separate components. These are automatic speech recognition that allows the source speech to be transcribed as text, machine translation of the transcribed text and finally text-to-speech synthesis to arrive at a spoken output in the target language.

In May 2019, Google announced Translatotron, a new system that performs direct speech-to-speech translation from one language to another without the intermediate steps. According to Google, it’s the world’s first system capable of doing that.

What’s more, the system allegedly retains the source speaker’s voice in the translated speech as well.

Although a considerable amount of resources and time are being invested in developing more advanced speech-to-speech translation solutions, they cannot be called simultaneous just yet and certainly are not close to reaching the excellence of a human interpreter.

Related posts

Subscribe to our newsletter