Selection of Training Data for HMM-based Speech Synthesis from Prosodic Features - Use of Generation Process Model of Fundamental Frequency Contours

Tomoyuki Mizukami, Hiroya Hashimoto, Keikichi Hirose, Daisuke Saito, Nobuaki Minematsu


Generation process model of fundamental frequency (F0) contours is ideal to represent global movements of F0’s keeping a clear relation with back-grounding linguistic information of utterances. Using the model, improvements of HMM-based speech synthesis are expected. A new method is developed to cope with erroneous F0’s of utterances included in HMM training corpus. F0 extraction errors not only cause wrong F0’s, but also degrade segmental features of synthetic speech, since they affect the over-all accuracy of speech analysis. The method is to exclude speech segments from HMM training, where extracted F0’s are largely different from those generated by the generation process model. Experiments on speech synthesis showed a clear improvement in synthetic speech quality when phoneme-base exclusion is conducted with a properly selected threshold.


 DOI: 10.21437/SpeechProsody.2014-197

Cite as: Mizukami, T., Hashimoto, H., Hirose, K., Saito, D., Minematsu, N. (2014) Selection of Training Data for HMM-based Speech Synthesis from Prosodic Features - Use of Generation Process Model of Fundamental Frequency Contours. Proc. 7th International Conference on Speech Prosody 2014, 1042-1046, DOI: 10.21437/SpeechProsody.2014-197.


@inproceedings{Mizukami2014,
  author={Tomoyuki Mizukami and Hiroya Hashimoto and Keikichi Hirose and Daisuke Saito and Nobuaki Minematsu},
  title={{Selection of Training Data for HMM-based Speech Synthesis from Prosodic Features - Use of Generation Process Model of Fundamental Frequency Contours}},
  year=2014,
  booktitle={Proc. 7th International Conference on Speech Prosody 2014},
  pages={1042--1046},
  doi={10.21437/SpeechProsody.2014-197},
  url={http://dx.doi.org/10.21437/SpeechProsody.2014-197}
}