Google’s Text-to-Speech is Indistinguishable from Human Voice

Tacotron 2 is a text-to-speech system that combines a network that predicts spectrograms with a modified WaveNet vocoder. The system can be trained from data without relying on complex feature engineering, and achieves state-of-the-art sound quality close to that of natural human speech.

"In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings," claim the researchers.

The system system still has difficulties pronouncing complex words, and in extreme cases it can even randomly generate strange noises. An additional limitation is that Tacotron 2 cannot yet generate audio in real time.

In the future the team hopes to add emotional tones to the speech. "We cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own," they state.