ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, Nam Soo Kim

Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.


doi: 10.21437/Interspeech.2021-469

Cite as: Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S. (2021) Diff-TTS: A Denoising Diffusion Model for Text-to-Speech. Proc. Interspeech 2021, 3605-3609, doi: 10.21437/Interspeech.2021-469

@inproceedings{jeong21_interspeech,
  author={Myeonghun Jeong and Hyeongju Kim and Sung Jun Cheon and Byoung Jin Choi and Nam Soo Kim},
  title={{Diff-TTS: A Denoising Diffusion Model for Text-to-Speech}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3605--3609},
  doi={10.21437/Interspeech.2021-469}
}