A generative adversarial network (GAN) is a versatile AI architecture type that’s exceptionally well-suited to synthesizing images, videos, and text from limited data. But it’s not much been applied to the audio production domain owing to a number of design challenges, which is why Google and Imperial College London researchers set out to create a GAN-based text-to-speech system capable of matching (or besting) state-of-the-art methods.
They say that their model not only generates high-fidelity speech with “naturalness” but that it’s highly parallelizable, meaning it’s more easily trained across multiple machines compared with conventional alternatives.
“A notable limitation of [state-of-the-art TTS] models is that they are difficult to parallelize over time: they predict each time step of an audio signal in sequence, which is computationally expensive and often impractical,” wrote the coauthors.
“A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel. An alternative approach for parallel waveform generation would be to use generative adversarial networks … To the best of our knowledge, GANs have not yet been applied at large scale to non-visual domains.”