Eighth ISCA Workshop on Speech Synthesis

Barcelona, Catalonia, Spain
August 31-September 2, 2013

Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from ‘found’ data: evaluation and analysis

Oliver Watts (1), Adriana Stan (2), Robert A. J. Clark (1), Yoshitaka Mamiya (1), Mircea Giurgiu (2), Junichi Yamagishi (1), Simon King (1)

(1) University of Edinburgh, UK
(2) Technical University of Cluj-Napoca, Romania

This paper presents techniques for building text-to-speech front-ends in a way that avoids the need for language-specific expert knowledge, but instead relies on universal resources (such as the Unicode character database) and unsupervised learning from unannotated data to ease system development. The acquisition of expert language-specific knowledge and expert annotated data is a major bottleneck in the development of corpus-based TTS systems in new languages. The methods presented here side-step the need for such resources as pronunciation lexicons, phonetic feature sets, part of speech tagged data, etc. The paper explains how the techniques introduced are applied to the 14 languages of a corpus of ‘found’ audiobook data. Results of an evaluation of the intelligibility of the systems resulting from applying these novel techniques to this data are presented. Index Terms: multilingual speech synthesis, unsupervised learning, vector space model, text-to-speech, audiobook data

Full Paper

Bibliographic reference.  Watts, Oliver / Stan, Adriana / Clark, Robert A. J. / Mamiya, Yoshitaka / Giurgiu, Mircea / Yamagishi, Junichi / King, Simon (2013): "Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from ‘found’ data: evaluation and analysis", In SSW8, 101-106.