Speech Synthesis in the Wild

Ganesh Sivaraman, Parav Nagarsheth, Elie Khoury


Speech synthesis has wide range of applications in modern artificial intelligence technologies. Most state-of-the-art speech synthesis systems usually require high quality recordings of large amounts of speech data of the target speaker. We focus on low-budget speech synthesis. Our software deals with methods to perform statistical parametric speech synthesis using unlabeled and mixed quality speech data sourced from the internet. An average voice model trained using DNN is adapted to a target speaker using different speaker adaptation strategies. Preprocessing methods like speech enhancement, diarization and segmentation are applied to the sourced data. Utterance selection based on Mean cepstral distortion and forced alignment confidence are applied to prune the noisy and mis-aligned data. The mixed quality data thus pre-processed is then used to adapt the average voice model and duration models to the target speaker. The software to be demonstrated automates the whole procedure from preprocessing to synthesis. The software will be demonstrated by performing live synthesis using audio sourced from Youtube.


Cite as: Sivaraman, G., Nagarsheth, P., Khoury, E. (2018) Speech Synthesis in the Wild. Proc. Interspeech 2018, 3217-3218.


@inproceedings{Sivaraman2018,
  author={Ganesh Sivaraman and Parav Nagarsheth and Elie Khoury},
  title={Speech Synthesis in the Wild},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3217--3218}
}