Sixth European Conference on Speech Communication and Technology
The prediction of accurate segmental durations remains a difficult problem when synthesising speech from text. Inaccurate durations are often perceptually prominent and detract from the naturalness of the quality of speech. For a concatenative system, a statistical approach is an excellent way of predicting segmental durations. More specifically a CART (Classification And Regression Tree) method is appropriate , but only if it has been correctly trained with data that reflects a phoneme’s characteristics. A feature-set is used to describe the flavour of a phoneme in the process of building of CART trees. We describe a novel method where BT’s Laureate Text-to-Speech system (TTS) is used to automatically donate the prosodic information required to make up the feature-set, ultimately being used as training data for building a CART tree. This tree, in turn, is used to predict segmental durations. The extraction of salience (derived from a metrical analysis of the text) and the other prosodic and segmental features in this way, is a novel concept. CART trees consistently show that this salience feature, in particular, has a large effect on the duration of a phoneme. The paper describes in detail this concept and shows the importance of salience. An evaluation of the effectiveness of CART-based duration modelling against the rule-based Laureate TTS method is given in the results.
Full Paper (PDF) Gnu-Zipped Postscript
Bibliographic reference. Deans, Paul / Breen, Andrew / Jackson, Peter (1999): "CART-based duration modeling using a novel method of extracting prosodic features", In EUROSPEECH'99, 1823-1826.