Deep learning for speech synthesis

Aäron van den Oord


With the advent of Deep Learning, Generative Modeling has dramatically improved, almost reaching the point that generated samples cannot be distinguished from real data. WaveNet has shown that it is possible to model high-dimensional audio so well that it can be used for speech synthesis, outperforming the best known methods such as concatenative and vocoder based systems. The main advantage of generative TTS, however, may be the flexibility of these learning-based approaches. The same system that learns to speak English fluently can also be trained for other languages, such as Mandarin, or even synthesize non-voice audio such as music. A single model can learn different speaker voices at once and can switch between them by conditioning on the speaker identity. It can also learn to adapt more quickly to new unseen data, learning new speakers from as little as a few sentences. Finally, generative TTS systems open the door to a wide variety of new applications, such as unsupervised phonetic unit discovery and speech compression.


Cite as: Oord, A.V.D. (2019) Deep learning for speech synthesis. Proc. 10th ISCA Speech Synthesis Workshop.


@inproceedings{Oord2019,
  author={Aäron van den Oord},
  title={{Deep learning for speech synthesis}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop}
}