This paper focuses on adaptation of acoustic models using speech transcribed by multiple noisy experts. A simple approach involves combining multiple transcripts using word frequency based Recognizer Output Voting Error Reduction (ROVER) followed by adaptation using the combined transcripts. But this assumes that the transcripts being combined are equally reliable. To overcome this assumption, we use two sets of scores to estimate this reliability. The first set is based on answers to some questions given by the transcribers. The second set is derived in an unsupervised way using the word frequency based ROVER transcripts and baseline acoustic models. The overall confidence is a convex combination of these scores and is used to perform a confidence weighted fusion. We adapt the baseline acoustic models using these combined transcripts. Recognition results for a Mexican Spanish ASR system show an absolute improvement of 0.5% in word error rate and 0.9% in sentence error rate.
Bibliographic reference. Audhkhasi, Kartik / Georgiou, Panayiotis G. / Narayanan, Shrikanth (2011): "Reliability-weighted acoustic model adaptation using crowd sourced transcriptions", In INTERSPEECH-2011, 3045-3048.