12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Reliability-Weighted Acoustic Model Adaptation Using Crowd Sourced Transcriptions

Kartik Audhkhasi, Panayiotis G. Georgiou, Shrikanth Narayanan

University of Southern California, USA

This paper focuses on adaptation of acoustic models using speech transcribed by multiple noisy experts. A simple approach involves combining multiple transcripts using word frequency based Recognizer Output Voting Error Reduction (ROVER) followed by adaptation using the combined transcripts. But this assumes that the transcripts being combined are equally reliable. To overcome this assumption, we use two sets of scores to estimate this reliability. The first set is based on answers to some questions given by the transcribers. The second set is derived in an unsupervised way using the word frequency based ROVER transcripts and baseline acoustic models. The overall confidence is a convex combination of these scores and is used to perform a confidence weighted fusion. We adapt the baseline acoustic models using these combined transcripts. Recognition results for a Mexican Spanish ASR system show an absolute improvement of 0.5% in word error rate and 0.9% in sentence error rate.

Full Paper

Bibliographic reference.  Audhkhasi, Kartik / Georgiou, Panayiotis G. / Narayanan, Shrikanth (2011): "Reliability-weighted acoustic model adaptation using crowd sourced transcriptions", In INTERSPEECH-2011, 3045-3048.