The prediction of accurate segmental durations remains a difficult problem when synthesising speech from text. Inaccurate durations are often perceptually prominent and detract from the naturalness of the quality of speech. For a concatenative system, a statistical approach is an excellent way of predicting segmental durations. More specifically a CART (Classification And Regression Tree) method is appropriate [1], but only if it has been correctly trained with data that reflects a phonemes characteristics. A feature-set is used to describe the flavour of a phoneme in the process of building of CART trees. We describe a novel method where BTs Laureate Text-to-Speech system (TTS) is used to automatically donate the prosodic information required to make up the feature-set, ultimately being used as training data for building a CART tree. This tree, in turn, is used to predict segmental durations. The extraction of salience (derived from a metrical analysis of the text) and the other prosodic and segmental features in this way, is a novel concept. CART trees consistently show that this salience feature, in particular, has a large effect on the duration of a phoneme. The paper describes in detail this concept and shows the importance of salience. An evaluation of the effectiveness of CART-based duration modelling against the rule-based Laureate TTS method is given in the results.
Cite as: Deans, P., Breen, A., Jackson, P. (1999) CART-based duration modeling using a novel method of extracting prosodic features. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 1823-1826, doi: 10.21437/Eurospeech.1999-397
@inproceedings{deans99_eurospeech, author={Paul Deans and Andrew Breen and Peter Jackson}, title={{CART-based duration modeling using a novel method of extracting prosodic features}}, year=1999, booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)}, pages={1823--1826}, doi={10.21437/Eurospeech.1999-397} }