Over the past several years, the primary focus of investigation for speech recognition has been over the telephone or IP network. Recently more and more IP telephony has been extensively used. This paper describes the performance of a speech recognizer on noisy speech transmitted over an H.323 IP telephony network, where the minimum mean-square error log spectra amplitude (MMSE-LSA) method [1,2] is used to reduce the mismatch between training and deployment condition in order to achieve robust speech recognition. In the H.323 network environment, the sources of distortion to the speech are packet loss and additive noise. In this work, we evaluate the impact of packet losses on speech recognition performance first, and then explore the effects of uncorrelated additive noise on the performance. To explore how additive acoustic noise affects the speech recognition performance, seven types of noise sources are selected for use in our experiments. Finally, the experimental results indicate that the MMSE-LSA enhancement method apparently increased robustness for some types of additive noise under certain packet loss rates over the H.323 telephone network.
Cite as: Chen, G., Tolba, H., O’Shaughnessy, D. (2006) Noise-robust speech recognition of conversational telephone speech. Proc. Interspeech 2006, paper 1304-Tue2BuP.3, doi: 10.21437/Interspeech.2006-338
@inproceedings{chen06b_interspeech, author={Gang Chen and Hesham Tolba and Douglas O’Shaughnessy}, title={{Noise-robust speech recognition of conversational telephone speech}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1304-Tue2BuP.3}, doi={10.21437/Interspeech.2006-338} }