ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks

Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang

In this paper, we propose a lightweight end-to-end text-to-speech model that can generate high-quality speech at breakneck speed. In our proposed model, a feature prediction module and a waveform generation module are combined within a single framework. The feature prediction module, which consists of two independent sub-modules, estimates latent space embeddings for input text and prosodic information, and the waveform generation module generates speech waveforms by conditioning on the estimated latent space embeddings. Unlike conventional approaches that estimate prosodic information using a pre-trained model, our model jointly trains the prosodic embedding network with the speech waveform generation task using an effective domain transfer technique. Experimental results show that our proposed model can generate samples 7 times faster than real-time, and about 1.6 times faster than FastSpeech 2, as we use only 13.4 million parameters. We confirm that the generated speech quality is still of a high standard as evaluated by mean opinion scores.


doi: 10.21437/Interspeech.2021-188

Cite as: Nguyen, H.-K., Jeong, K., Um, S., Hwang, M.-J., Song, E., Kang, H.-G. (2021) LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks. Proc. Interspeech 2021, 3595-3599, doi: 10.21437/Interspeech.2021-188

@inproceedings{nguyen21e_interspeech,
  author={Huu-Kim Nguyen and Kihyuk Jeong and Seyun Um and Min-Jae Hwang and Eunwoo Song and Hong-Goo Kang},
  title={{LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3595--3599},
  doi={10.21437/Interspeech.2021-188}
}