This paper presents techniques for building text-to-speech front-ends in a way that avoids the need for language-specific expert knowledge, but instead relies on universal resources (such as the Unicode character database) and unsupervised learning from unannotated data to ease system development. The acquisition of expert language-specific knowledge and expert annotated data is a major bottleneck in the development of corpus-based TTS systems in new languages. The methods presented here side-step the need for such resources as pronunciation lexicons, phonetic feature sets, part of speech tagged data, etc. The paper explains how the techniques introduced are applied to the 14 languages of a corpus of ‘found’ audiobook data. Results of an evaluation of the intelligibility of the systems resulting from applying these novel techniques to this data are presented.
Index Terms: multilingual speech synthesis, unsupervised learning, vector space model, text-to-speech, audiobook data
Cite as: Watts, O., Stan, A., Clark, R.A.J., Mamiya, Y., Giurgiu, M., Yamagishi, J., King, S. (2013) Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from ‘found’ data: evaluation and analysis. Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8), 101-106
@inproceedings{watts13_ssw, author={Oliver Watts and Adriana Stan and Robert A. J. Clark and Yoshitaka Mamiya and Mircea Giurgiu and Junichi Yamagishi and Simon King}, title={{Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from ‘found’ data: evaluation and analysis}}, year=2013, booktitle={Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8)}, pages={101--106} }