Large vocabulary automatic speech recognition might assist hearing impaired telephone users by displaying a transcription of the incoming side of the conversation, but the system would have to achieve sufficient accuracy on conversational-style, telephone-bandwidth speech. We describe our development work toward such a system. This work comprised three phases: Experiments with clean data filtered to 200-3500Hz, experiments with real telephone data, and language model development. In the first phase, the speaker independent error rate was reduced from 25% to 12% by using MLLT, increasing the number of cepstral components from 9 to 13, and increasing the number of Gaussians from 30,000 to 120,000. The resulting system, however, performed less well on actual telephony, producing an error rate of 28.4%. By additional adaptation and the use of an LDA and CDCN combination, the error rate was reduced to 19.1%. Speaker adaptation reduces the error rate to 10.96%. These results were obtained with read speech. To explore the language-model requirements in a more realistic situation, we collected some conversational speech with an arrangement in which one participant could not hear the conversation but only saw recognizer output on a screen. We found that a mixture of language models, one derived from the Switchboard corpus and the other from prepared texts, resulted in approximately 10% fewer errors than either model alone.
Cite as: Jan, E.-E., Bakis, R., Liu, F.-H., Picheny, M. (1998) Telephone band LVCSR for hearing-impaired users. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 0862, doi: 10.21437/ICSLP.1998-680
@inproceedings{jan98_icslp, author={Ea-Ee Jan and Raimo Bakis and Fu-Hua Liu and Michael Picheny}, title={{Telephone band LVCSR for hearing-impaired users}}, year=1998, booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)}, pages={paper 0862}, doi={10.21437/ICSLP.1998-680} }