Eighth ISCA Workshop on Speech Synthesis
Barcelona, Catalonia, Spain
To allow the average-voice-based speech synthesis technique to generate synthetic speech that is more similar to that of the target speaker, we propose a model training technique that introduces the label of speaker class. Speaker class represents the voice characteristics of speakers. In the proposed technique, first, all training data are clustered to determine classes of speaker type. The average voice model is trained using the labels of conventional context and speaker class. In the speaker adaptation process, the target speakers class is estimated and is used to transform the average voice model into the target speakers model. As a result, the speech of the target speaker is synthesized from the target speakers model and the estimated target speakers speaker class. The results of an objective experiment show that the proposed technique significantly reduces the RMS errors of log F0. Moreover, the results of a subjective experiment indicate that the proposal yields synthesized speech with better similarity than the conventional method. Index Terms: HMM-based speech synthesis, average voice model, speaker adaptation, speaker clustering
Bibliographic reference. Ijima, Yusuke / Miyazaki, Noboru / Mizuno, Hideyuki (2013): "Statistical model training technique for speech synthesis based on speaker class", In SSW8, 141-145.