11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

The Use of Sense in Unsupervised Training of Acoustic Models for ASR Systems

Rita Singh, Benjamin Lambert, Bhiksha Raj

Carnegie Mellon University, USA

In unsupervised training of ASR systems, no annotated data are assumed to exist. Word-level annotations for training audio are generated iteratively using an ASR system. At each iteration a subset of data judged as having the most reliable transcriptions is selected to train the next set of acoustic models. Data selection however remains a difficult problem, particularly when the error rate of the recognizer providing the initial annotation is very high. In this paper we propose an iterative algorithm that uses a combination of likelihoods and a simple model of sense to select data. We show that the algorithm is effective for unsupervised training of acoustic models, particularly when the initial annotation is highly erroneous. Experiments conducted on Fisher-1 data using initial models from Switchboard, and a vocabulary and LM derived from the Google N-grams, show that performance on a selected held-out test set from Fisher data improves when we use the proposed iterative approach.

Full Paper

Bibliographic reference.  Singh, Rita / Lambert, Benjamin / Raj, Bhiksha (2010): "The use of sense in unsupervised training of acoustic models for ASR systems", In INTERSPEECH-2010, 2938-2941.