The construction of high-performance acoustic models for certain speech recognition tasks is very costly and time-consuming, since it most often requires the collection and transcription of large amounts of task-specific speech data. In this paper acoustic modeling for spoken dialogue systems based on unsupervised selective training is examined. The main idea is to select those training utterances from an (untranscribed) speech data pool, so that the likelihood of a separate small (transcribed) development speech data set is maximized. If only the selected data are employed to retrain the initial acoustic models, a better performance is achieved than when retraining with all collected data. Using the proposed approach it is also possible to considerably reduce the costs for human-labeling of the speech data without compromising the performance. Furthermore, the method provides means for automatic task-adaptation of acoustic models, e.g. to adult or children speech. This is important, since detailed information about each automatically collected utterance is usually not available.
Cite as: Cincarek, T., Toda, T., Saruwatari, H., Shikano, K. (2006) Acoustic modeling for spoken dialogue systems based on unsupervised utterance-based selective training. Proc. Interspeech 2006, paper 1481-Wed2A2O.2, doi: 10.21437/Interspeech.2006-478
@inproceedings{cincarek06_interspeech, author={Tobias Cincarek and Tomoki Toda and Hiroshi Saruwatari and Kiyohiro Shikano}, title={{Acoustic modeling for spoken dialogue systems based on unsupervised utterance-based selective training}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1481-Wed2A2O.2}, doi={10.21437/Interspeech.2006-478} }