INTERSPEECH 2004 - ICSLP
In our fully automatic corpus-based method of generating fundamental frequency (F0) contours for emotional speech synthesis, an improvement was realized related to the process of corpus preparation. The method assumes the generation process model and predicts its command parameters using binary regression trees with inputs of linguistic information of the sentence to be synthesized. Because of the model constraint, a certain quality is still kept in synthesized speech even if the prediction is done incorrectly. The speech corpus includes three types of emotional speech (anger, joy, sadness) and calm speech uttered by a female narrator. The command parameters necessary for the training (and testing) of the method were automatically extracted from speech using a program developed by the authors. Since the accuracy of the extraction largely affects the prediction performance, a constraint is newly applied on the position of phrase commands during the extraction. Also, since performance of phrase command prediction dominates the overall accuracy of generated F0 contours, the method was modified to predict phrase commands first. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Synthesis of emotional speech was conducted with text inputs. The segmental features were handled by the HMM synthesis method and the phoneme durations are predicted in a similar corpus-based method. Perceptual experiment was conducted using the synthesized speech, and the result indicated that the anger could be well conveyed by the developed method. The result came worse for joy and sadness.
Bibliographic reference. Hirose, Keikichi (2004): "Improvement in corpus-based generation of F0 contours using generation process model for emotional speech synthesis", In INTERSPEECH-2004, 1349-1352.