A well-known unfavorable property of HMMs in speech recognition is their inappropriate representation of phone and word durations. This paper describes an approach to resolve this limitation by integrating explicit word duration models into an HMM-based speech recognizer. Word durations are represented by log-normal densities using a back-off strategy that approximates durations of words that have been observed seldom by a combination of the statistics of suitable sub-word units. Furthermore, two different normalization procedures are compared which reduce the influence of the implicit HMM duration distribution resulting from the state-to-state transition probabilities. Experiments on European parliamentary speeches in English and Spanish language show that the proposed approaches are effective and lead to small, but consistent reductions in the word error rate for large-vocabulary speech recognition tasks.
Bibliographic reference. Seppi, Dino / Falavigna, Daniele / Stemmer, Georg / Gretter, Roberto (2007): "Word duration modeling for word graph rescoring in LVCSR", In INTERSPEECH-2007, 1805-1808.