Previous work on cross-lingual transfer learning in text-tospeech has shown the effectiveness of fine-tuning phonemic representations on small amounts of target language data. In other contexts, phonological features (PFs) have been suggested as a more suitable input representation than phonemes for sharing acoustic information between languages, for example in multilingual model training or for code-switching synthesis where an utterance may contain words from multiple languages. Starting from a model trained on 14 hours of English, we find that cross-lingual fine-tuning with 15 minutes of German data can produce speech with subjective naturalness ratings comparable to a model trained from scratch on 4 hours of German, using either phonemes or PFs. We also find a modest but statistically significant improvement in naturalness ratings using PFs over phonemes when training from scratch on 4 hours of German.
Cite as: Wells, D., Richmond, K. (2021) Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 160-165, doi: 10.21437/SSW.2021-28
@inproceedings{wells21_ssw, author={Dan Wells and Korin Richmond}, title={{Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis}}, year=2021, booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)}, pages={160--165}, doi={10.21437/SSW.2021-28} }