Eighth ISCA Workshop on Speech Synthesis

Barcelona, Catalonia, Spain
August 31-September 2, 2013

Statistical Model Training Technique for Speech Synthesis Based on Speaker Class

Yusuke Ijima, Noboru Miyazaki, Hideyuki Mizuno

NTT Corporation, Japan

To allow the average-voice-based speech synthesis technique to generate synthetic speech that is more similar to that of the target speaker, we propose a model training technique that introduces the label of speaker class. Speaker class represents the voice characteristics of speakers. In the proposed technique, first, all training data are clustered to determine classes of speaker type. The average voice model is trained using the labels of conventional context and speaker class. In the speaker adaptation process, the target speaker’s class is estimated and is used to transform the average voice model into the target speaker’s model. As a result, the speech of the target speaker is synthesized from the target speaker’s model and the estimated target speaker’s speaker class. The results of an objective experiment show that the proposed technique significantly reduces the RMS errors of log F0. Moreover, the results of a subjective experiment indicate that the proposal yields synthesized speech with better similarity than the conventional method. Index Terms: HMM-based speech synthesis, average voice model, speaker adaptation, speaker clustering

Full Paper

Bibliographic reference.  Ijima, Yusuke / Miyazaki, Noboru / Mizuno, Hideyuki (2013): "Statistical model training technique for speech synthesis based on speaker class", In SSW8, 141-145.