ISCA Archive SSW 2013
ISCA Archive SSW 2013

Statistical model training technique for speech synthesis based on speaker class

Yusuke Ijima, Noboru Miyazaki, Hideyuki Mizuno

To allow the average-voice-based speech synthesis technique to generate synthetic speech that is more similar to that of the target speaker, we propose a model training technique that introduces the label of speaker class. Speaker class represents the voice characteristics of speakers. In the proposed technique, first, all training data are clustered to determine classes of speaker type. The average voice model is trained using the labels of conventional context and speaker class. In the speaker adaptation process, the target speaker’s class is estimated and is used to transform the average voice model into the target speaker’s model. As a result, the speech of the target speaker is synthesized from the target speaker’s model and the estimated target speaker’s speaker class. The results of an objective experiment show that the proposed technique significantly reduces the RMS errors of log F0. Moreover, the results of a subjective experiment indicate that the proposal yields synthesized speech with better similarity than the conventional method.

Index Terms: HMM-based speech synthesis, average voice model, speaker adaptation, speaker clustering


Cite as: Ijima, Y., Miyazaki, N., Mizuno, H. (2013) Statistical model training technique for speech synthesis based on speaker class. Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8), 141-145

@inproceedings{ijima13_ssw,
  author={Yusuke Ijima and Noboru Miyazaki and Hideyuki Mizuno},
  title={{Statistical model training technique for speech synthesis based on speaker class}},
  year=2013,
  booktitle={Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8)},
  pages={141--145}
}