Selection and Training Schemes for Improving TTS Voice Built on Found Data

F.-Y. Kuo, I.C. Ouyang, S. Aryal, Pierre Lanchantin


This work investigates different selection and training schemes to improve the naturalness of synthesized text-to-speech voices built on found data. The approach outlined in this paper examines the combinations of different metrics to detect and reject segments of training data that can degrade the performance of the system. We conducted a series of objective and subjective experiments on two 24-hour single-speaker corpuses of found data collected from diverse sources. We show that using an even smaller, yet carefully selected, set of data can lead to a text-to-speech system able to generate more natural speech than a system trained on the complete dataset. Moreover, we show that training the system by fine-tuning from the system trained on the whole dataset leads to additional improvement in naturalness by allowing a more aggressive selection of training data.


 DOI: 10.21437/Interspeech.2019-2816

Cite as: Kuo, F., Ouyang, I., Aryal, S., Lanchantin, P. (2019) Selection and Training Schemes for Improving TTS Voice Built on Found Data. Proc. Interspeech 2019, 1516-1520, DOI: 10.21437/Interspeech.2019-2816.


@inproceedings{Kuo2019,
  author={F.-Y. Kuo and I.C. Ouyang and S. Aryal and Pierre Lanchantin},
  title={{Selection and Training Schemes for Improving TTS Voice Built on Found Data}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1516--1520},
  doi={10.21437/Interspeech.2019-2816},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2816}
}