Speech Prosody 2004
A method was developed for the corpus-based synthesis of emotional speech. Fundamental frequency (F0) contours were synthesized by predicting command values of the generation process model using binary regression trees with the input of linguistic information of the sentence to be synthesized. Because of the model constraint, a certain quality is still kept in synthesized speech even if the prediction is done poorly. Prediction of the accent phrase boundaries for the input text, a necessary process for the synthesis, was also realized in a similar statistical framework. The HMM synthesis scheme was used to generate segmental features. The speech corpus used for the synthesis includes three types of emotional speech (anger, joy, sadness) and calm speech uttered by a female narrator. The command values of the model necessary for the training and testing of the method were automatically extracted using a program developed by the authors. For the better prediction, accent phrases where the automatic extraction was done poorly were excluded from the training corpus. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Larger mismatches were observed for sad speech and joyful speech. Perceptual experiment was conducted using synthesized speech, and the result indicated that the anger could be well conveyed by the developed method.
Bibliographic reference. Hirose, Keikichi / Sato, Kentaro / Minematsu, Nobuaki (2004): "Emotional speech synthesis with corpus-based generation of F0 contours using generation process model", In SP-2004, 421-424.