14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Phone Duration Modeling Using Clustering of Rich Contexts

Tanel Alumäe, Rena Nemoto

Tallinn University of Technology, Estonia

This paper describes a phone duration model applied to speech recognition. The model is based on a decision tree that finds clusters of phones in various contexts that tend to have similar durations. Wide contexts with rich linguistic and phonetic features are used. To better model varying and non-stationary speaking rates, the contextual features also include the observed duration values of previous phones. For each resulting phone cluster, a log-normal distribution of duration is estimated. The resulting decision tree and the log-normal distributions are used to calculate likelihoods of phone durations in N-best lists. Experiments on two Estonian recognition tasks show a small but significant improvement in speech recognition accuracy.

Full Paper

Bibliographic reference.  Alumäe, Tanel / Nemoto, Rena (2013): "Phone duration modeling using clustering of rich contexts", In INTERSPEECH-2013, 1801-1805.