A corpus-based method of generating fundamental frequency (F0) contours of various speaking styles from text was developed. Instead of directly predicting F0 values, the method predicts command values of the F0 contour generation process model. Because of the model constraint, the resulting F0 contour keeps certain naturalness even when the prediction is done incorrectly. The method includes a scheme of automatic extraction of the model commands, which is necessary to prepare the training corpuses for various speaking styles. By introducing constraints on phrase command locations, a better extraction was realized, led to a better performance of the method. Speech synthesis was conducted using HMM speech synthesizer for calm speech and three types of emotional speech. The perceptual experiment showed the designated emotions could be well conveyed with the F0 contours generated by the developed method.
Cite as: Hirose, K., Sato, K., Minematsu, N. (2004) Corpus-based synthesis of fundamental frequency contours with various speaking styles from text using F0 contour generation process model. Proc. 5th ISCA Workshop on Speech Synthesis (SSW 5), 161-166
@inproceedings{hirose04_ssw, author={Keikichi Hirose and Kentaro Sato and Nobuaki Minematsu}, title={{Corpus-based synthesis of fundamental frequency contours with various speaking styles from text using F0 contour generation process model}}, year=2004, booktitle={Proc. 5th ISCA Workshop on Speech Synthesis (SSW 5)}, pages={161--166} }