Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Inducing Decision Tree Pronunciation Variation Models from Annotated Speech Data

Per-Anders Jande

KTH, Stockholm, Sweden

A model of pronunciation of words in discourse context has been induced from the annotation of a spoken language corpus. The information included in the annotation is a set of variables hypothesised to be important for the pronunciation of words in discourse context. The annotation is connected to segmentally defined units on tiers corresponding to linguistically relevant units: the discourse, the utterance, the phrase, the word, the syllable and the phoneme. The model is represented as a tree structure, making it transparent for analysis and easy to use in a speech synthesis system. Using phonemic canonical pronunciation representations to estimate the segmental string of the annotated data gives a 22.1% phone error rate. Decision tree pronunciation variation models generated in a tenfold cross validation procedure showed an average phone error rate of 9.9%. Using multiple context variables for modelling pronunciation variation could thus reduce the error rate by 55%, compared to a baseline using canonical pronunciation representations.

Full Paper

Bibliographic reference.  Jande, Per-Anders (2005): "Inducing decision tree pronunciation variation models from annotated speech data", In INTERSPEECH-2005, 1945-1948.