10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Thousands of Voices for HMM-Based Speech Synthesis

Junichi Yamagishi (1), Bela Usabaev (2), Simon King (1), Oliver Watts (1), John Dines (3), Jilei Tian (4), Rile Hu (4), Yong Guan (4), Keiichiro Oura (5), Keiichi Tokuda (5), Reima Karhila (6), Mikko Kurimo (6)

(1) University of Edinburgh, UK
(2) Universität Tübingen, Germany
(3) IDIAP Research Institute, Switzerland
(4) Nokia Research Center, China
(5) Nagoya Institute of Technology, Japan
(6) Helsinki University of Technology, Finland

Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ‘non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.

Full Paper

Bibliographic reference.  Yamagishi, Junichi / Usabaev, Bela / King, Simon / Watts, Oliver / Dines, John / Tian, Jilei / Hu, Rile / Guan, Yong / Oura, Keiichiro / Tokuda, Keiichi / Karhila, Reima / Kurimo, Mikko (2009): "Thousands of voices for HMM-based speech synthesis", In INTERSPEECH-2009, 420-423.