Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams

Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, Helen Meng

We present a novel approach that enables a target speaker (e.g. monolingual Chinese speaker) to speak a new language (e.g. English) based on arbitrary textual input. Our system includes a trained English speaker-independent automatic speech recognition (SI-ASR) engine using TIMIT. Given the target speaker’s speech in a non-target language, we generate Phonetic PosteriorGrams (PPGs) with the SI-ASR and then train a Deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks (DBLSTM) to model the relationships between the PPGs and the acoustic signal. Synthesis involves input of arbitrary text to a general TTS engine (trained on any non-target speaker), the output of which is indexed by SI-ASR as PPGs. These are used by the DBLSTM to synthesize the target language in the target speaker’s voice. A main advantage of this approach has very low training data requirement of the target speaker which can be in any language, as compared with a reference approach of training a special TTS engine using many recordings from the target speaker only in the target language. For a given target speaker, our proposed approach trained on 100 Mandarin (i.e. non-target language) utterances achieves comparable performance (in MOS and ABX test) of English synthetic speech as an HTS system trained on 1,000 English utterances.

DOI: 10.21437/Interspeech.2016-1043

Cite as

Sun, L., Wang, H., Kang, S., Li, K., Meng, H. (2016) Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams. Proc. Interspeech 2016, 322-326.

author={Lifa Sun and Hao Wang and Shiyin Kang and Kun Li and Helen Meng},
title={Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams},
booktitle={Interspeech 2016},