5th International Conference on Spoken Language Processing
This paper presents two linguistic techniques to improve broadcast news transcription. The first one is an adaptation of a language model which reflects current news content. It is based on a weighted mixture of long-term news scripts and latest scripts as training data. The mixture weights are given by the EM algorithm for linear interpolation and then normalized by their text sizes. Not only n-grams but also the vocabulary are updated by the latest news. We call it the Time Dependent Language Model (TDLM). It achieved a 4.4% reduction in perplexity and 0.7% improvement in word accuracy over the baseline language model. The second technique is correction of the decoded transcriptions by their corresponding electronic draft scripts. The corresponding drafts are found by using a sentence similarity measure between them. Parts to be considered as recognition errors are replaced with the original drafts. This post-correction led to a 6.7% improvement in word accuracy.
Bibliographic reference. Kobayashi, Akio / Onoe, Kazuo / Imai, Toru / Ando, Akio (1998): "Time dependent language model for broadcast news transcription and its post-correction", In ICSLP-1998, paper 0973.