ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Inducing decision tree pronunciation variation models from annotated speech data

Per-Anders Jande

A model of pronunciation of words in discourse context has been induced from the annotation of a spoken language corpus. The information included in the annotation is a set of variables hypothesised to be important for the pronunciation of words in discourse context. The annotation is connected to segmentally defined units on tiers corresponding to linguistically relevant units: the discourse, the utterance, the phrase, the word, the syllable and the phoneme. The model is represented as a tree structure, making it transparent for analysis and easy to use in a speech synthesis system. Using phonemic canonical pronunciation representations to estimate the segmental string of the annotated data gives a 22.1% phone error rate. Decision tree pronunciation variation models generated in a tenfold cross validation procedure showed an average phone error rate of 9.9%. Using multiple context variables for modelling pronunciation variation could thus reduce the error rate by 55%, compared to a baseline using canonical pronunciation representations.


doi: 10.21437/Interspeech.2005-608

Cite as: Jande, P.-A. (2005) Inducing decision tree pronunciation variation models from annotated speech data. Proc. Interspeech 2005, 1945-1948, doi: 10.21437/Interspeech.2005-608

@inproceedings{jande05_interspeech,
  author={Per-Anders Jande},
  title={{Inducing decision tree pronunciation variation models from annotated speech data}},
  year=2005,
  booktitle={Proc. Interspeech 2005},
  pages={1945--1948},
  doi={10.21437/Interspeech.2005-608}
}