Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ‘non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.
Bibliographic reference. Yamagishi, Junichi / Usabaev, Bela / King, Simon / Watts, Oliver / Dines, John / Tian, Jilei / Hu, Rile / Guan, Yong / Oura, Keiichiro / Tokuda, Keiichi / Karhila, Reima / Kurimo, Mikko (2009): "Thousands of voices for HMM-based speech synthesis", In INTERSPEECH-2009, 420-423.