15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Transform Mapping Using Shared Decision Tree Context Clustering for HMM-Based Cross-Lingual Speech Synthesis

Daiki Nagahama (1), Takashi Nose (2), Tomoki Koriyama (1), Takao Kobayashi (1)

(1) Tokyo Institute of Technology, Japan
(2) Tohoku University, Japan

This paper proposes a novel transform mapping technique based on shared decision tree context clustering (STC) for HMM-based cross-lingual speech synthesis. In the conventional cross-lingual speaker adaptation based on state mapping, the adaptation performance is not always satisfactory when there are mismatches of languages and speakers between the average voice models of input and output languages. In the proposed technique, we alleviate the effect of the mismatches on the transform mapping by introducing a language-independent decision tree constructed by STC, and represent the average voice models using language-independent and dependent tree structures. We also use a bilingual speech corpus for keeping speaker characteristics between the average voice models of different languages. The experimental results show that the proposed technique decreases both spectral and prosodic distortions between original and generated parameter trajectories and significantly improves the naturalness of synthetic speech while keeping the speaker similarity compared to the state mapping.

Full Paper

Bibliographic reference.  Nagahama, Daiki / Nose, Takashi / Koriyama, Tomoki / Kobayashi, Takao (2014): "Transform mapping using shared decision tree context clustering for HMM-based cross-lingual speech synthesis", In INTERSPEECH-2014, 770-774.