ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

Weighted neural network ensemble models for speech prosody control

Harald Romsdorfer

In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on classification and regression trees (CART), multivariate adaptive regression splines (MARS), or artificial neural networks (ANN), shows a 12% improvement compared to the best duration model and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently presented ensemble model based on gradient tree boosting.


doi: 10.21437/Interspeech.2009-183

Cite as: Romsdorfer, H. (2009) Weighted neural network ensemble models for speech prosody control. Proc. Interspeech 2009, 492-495, doi: 10.21437/Interspeech.2009-183

@inproceedings{romsdorfer09b_interspeech,
  author={Harald Romsdorfer},
  title={{Weighted neural network ensemble models for speech prosody control}},
  year=2009,
  booktitle={Proc. Interspeech 2009},
  pages={492--495},
  doi={10.21437/Interspeech.2009-183}
}