We are developing a model which predicts phoneme duration as a function of segmental and suprasegmental factors, with the objective of using it for speech recognition. Our goal is to account for the many duration effects, ranging from local phonetic to sentence-level, and to determine how accurately we can model segment durations for sentences drawn from a large database spoken by many speakers. Our approach is to develop a hierarchical structure of categorical distinctions based on discrete-valued variables representing attributes of a phoneme and its context. We choose this technique over additive or multiplicative models because duration effects often interact in a complex manner. In our procedure, two descendents of a parent node can be split using different variables, thus allowing us to model non-uniform interactions among factors. When tested on 630 sentences from 126 speakers not used for training, our models explain 60% of vowel duration variance and 55% of the consonant duration variance within manner classes, yielding a root-mean-square prediction error of approximately 31 ms for vowels and 26 ms for consonants.
Bibliographic reference. Pitrelli, John F. / Zue, Victor W. (1989): "A hierarchical model for phoneme duration in american English", In EUROSPEECH-1989, 2324-2327.