Interspeech'2005 - Eurospeech
A model of pronunciation of words in discourse context has been induced from the annotation of a spoken language corpus. The information included in the annotation is a set of variables hypothesised to be important for the pronunciation of words in discourse context. The annotation is connected to segmentally defined units on tiers corresponding to linguistically relevant units: the discourse, the utterance, the phrase, the word, the syllable and the phoneme. The model is represented as a tree structure, making it transparent for analysis and easy to use in a speech synthesis system. Using phonemic canonical pronunciation representations to estimate the segmental string of the annotated data gives a 22.1% phone error rate. Decision tree pronunciation variation models generated in a tenfold cross validation procedure showed an average phone error rate of 9.9%. Using multiple context variables for modelling pronunciation variation could thus reduce the error rate by 55%, compared to a baseline using canonical pronunciation representations.
Bibliographic reference. Jande, Per-Anders (2005): "Inducing decision tree pronunciation variation models from annotated speech data", In INTERSPEECH-2005, 1945-1948.