ISCA Archive Interspeech 2013
ISCA Archive Interspeech 2013

Text-to-speech inspired duration modeling for improved whole-word acoustic models

Keith Kintzley, Aren Jansen, Hynek Hermansky

In the construction of whole-word acoustic models, we have previously demonstrated substantial gains by using MAP estimation to introduce a simple prior model of phonetic timing. Based solely on the word's phonetic (dictionary) pronunciation, this simple model included no information about the individual durations of constituent phones. However, the problem of modeling segmental duration has long been studied in the text-to-speech (TTS) community. We draw upon this work to develop a classification and regression tree (CART) approach for constructing prior models of phonetic timing which considers factors such as syllable stress, syllable position, adjacent phone class and voicing. This improved prior model closes 33% of the gap in keyword spotting performance between highly supervised whole-word models and those estimated without any examples.


doi: 10.21437/Interspeech.2013-337

Cite as: Kintzley, K., Jansen, A., Hermansky, H. (2013) Text-to-speech inspired duration modeling for improved whole-word acoustic models. Proc. Interspeech 2013, 1253-1257, doi: 10.21437/Interspeech.2013-337

@inproceedings{kintzley13_interspeech,
  author={Keith Kintzley and Aren Jansen and Hynek Hermansky},
  title={{Text-to-speech inspired duration modeling for improved whole-word acoustic models}},
  year=2013,
  booktitle={Proc. Interspeech 2013},
  pages={1253--1257},
  doi={10.21437/Interspeech.2013-337}
}