Neural Text-to-Speech (TTS) synthesis is able to generate highquality speech with natural prosody. However, these systems typically require a large amount of data, preferably recorded in a clean and noise-free environment. We focus on creating target voices from low quality public recordings and our findings show that even with a large amount of data from a specific speaker, it is challenging to train a speaker-dependent neural TTS model. In order to improve the voice quality, while simultaneously reducing the amount of data required, we introduce meta-learning to adapt the neural TTS front-end. We propose three approaches for multi-speaker systems: (1) a lookup-table-based system, (2) a speaker representation derived from the Personalized Hey Siri (PHS) system, and (3) a system with no speaker encoder. Results show that: i) By using a significantly smaller number of target voice recordings, the proposed system based on embeddings trained from the PHS system can generate comparable quality and speaker similarity to the speaker-dependent model trained solely on the target voice. ii) Applying meta-learning to Tacotron can effectively learn a representation of an unseen speaker. iii) For low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
Cite as: Hu, Q., Marchi, E., Winarsky, D., Stylianou, Y., Naik, D., Kajarekar, S. (2019) Neural Text-to-Speech Adaptation from Low Quality Public Recordings. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 24-28, doi: 10.21437/SSW.2019-5
@inproceedings{hu19_ssw, author={Qiong Hu and Erik Marchi and David Winarsky and Yannis Stylianou and Devang Naik and Sachin Kajarekar}, title={{Neural Text-to-Speech Adaptation from Low Quality Public Recordings}}, year=2019, booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)}, pages={24--28}, doi={10.21437/SSW.2019-5} }