We propose an approach to render speech sentences of different languages out of a speaker's monolingual recordings for building mixed-coded TTS systems. The difference between two monolingual speakers' corpora, e.g. English and Chinese, are firstly equalized by warping spectral frequency, removing F0 variation and adjusting speaking rate across speakers and languages. The English speaker's Chinese speech is then rendered by a trajectory tilling approach. The Chinese speaker's parameter trajectories, which are equalized towards English speaker, are used to guide the search for the best sequence of 5ms waveform "tiles" in English speaker's recordings. The rendered English speaker's Chinese speech together with her own English recordings is finally used to train a mixed-language (English-Chinese) HMM-based TTS. Experimental results show that the proposed approach can synthesize high quality of mixed-language speech, which is also confirmed in both objective and subjective evaluations.
Index Terms: Mixed-language TTS, HMM-based TTS, Unit Selection, Trajectory Tiling
Full Paper Audio Example
Bibliographic reference. He, Ji / Qian, Yao / Soong, Frank K. / Zhao, Sheng (2012): "Turning a monolingual speaker into multilingual for a mixed-language TTS", In INTERSPEECH-2012, 963-966.