5th International Conference on Spoken Language Processing
Large vocabulary automatic speech recognition might assist hearing impaired telephone users by displaying a transcription of the incoming side of the conversation, but the system would have to achieve sufficient accuracy on conversational-style, telephone-bandwidth speech. We describe our development work toward such a system. This work comprised three phases: Experiments with clean data filtered to 200-3500Hz, experiments with real telephone data, and language model development. In the first phase, the speaker independent error rate was reduced from 25% to 12% by using MLLT, increasing the number of cepstral components from 9 to 13, and increasing the number of Gaussians from 30,000 to 120,000. The resulting system, however, performed less well on actual telephony, producing an error rate of 28.4%. By additional adaptation and the use of an LDA and CDCN combination, the error rate was reduced to 19.1%. Speaker adaptation reduces the error rate to 10.96%. These results were obtained with read speech. To explore the language-model requirements in a more realistic situation, we collected some conversational speech with an arrangement in which one participant could not hear the conversation but only saw recognizer output on a screen. We found that a mixture of language models, one derived from the Switchboard corpus and the other from prepared texts, resulted in approximately 10% fewer errors than either model alone.
Bibliographic reference. Jan, Ea-Ee / Bakis, Raimo / Liu, Fu-Hua / Picheny, Michael (1998): "Telephone band LVCSR for hearing-impaired users", In ICSLP-1998, paper 0862.