Interspeech'2005 - Eurospeech
While large amounts of manually transcribed acoustic training data is available for well-known large vocabulary speech recognition tasks such as, the transcription of broadcast news and switchboard conversations, a significantly less amount is available for several large spoken collections such as the MALACH corpus (in multiple languages), meeting recordings, presentations at conferences, call center conversations, etc. However, these collections offer vast quantities of untranscribed spontaneous speech that can be used to improve recognition accuracies. Several narrow-band and broadband speech collections are currently available and carefully tuned speech recognition systems trained on several hundred hour of manually transcribed data are now able to achieve word error rates between 10% and 40%, depending on the difficulty of the collection. This paper studies the use of automatically recognized transcriptions at several levels of recognition accuracy to train acoustic models and the performance improvements obtained with such unsupervised training. This paper also proposes a recipe for selection of feature vectors at the utterance, word or fragment level for training acoustic models that provides the maximum gain in recognition accuracy. This paper demonstrates that a reduction in overall word error rate of up to 20% relative can be obtained with careful selection of acoustic training data.
Bibliographic reference. Ramabhadran, Bhuvana (2005): "Exploiting large quantities of spontaneous speech for unsupervised training of acoustic models", In INTERSPEECH-2005, 1617-1620.