In real-world adverse environments, speech signal corruption by background noise, microphone channel variations, and speech production adjustments introduced by speakers in an effort to communicate efficiently over noise (Lombard effect) severely impact automatic speech recognition (ASR) performance. Recently, a set of unsupervised techniques reducing ASR sensitivity to these sources of distortion have been presented, with the main focus on equalization of Lombard effect (LE). The algorithms performing maximum-likelihood spectral transformation, cepstral dynamics normalization, and decoding with a codebook of noisy speech models have been shown to outperform conventional methods, however, at a cost of considerable increase in computational complexity due to required numerous decoding passes through the ASR models. In this study, a scheme utilizing a set of speech-in-noise Gaussian mixture models and a neutral/LE classifier is shown to substantially decrease the computational load (from 14 to 2–4 ASR decoding passes) while preserving overall system performance. In addition, an extended codebook capturing multiple environmental noises is introduced and shown to improve ASR in changing environments (8.2–49.2% absolute WER improvement). The evaluation is performed on the Czech Lombard Speech Database (CLSD’05). The task is to recognize neutral/LE connected digit strings presented in different levels of background car noise and Aurora 2 noises.
Bibliographic reference. Bořil, Hynek / Hansen, John H. L. (2009): "Reduced complexity equalization of lombard effect for speech recognition in noisy adverse environments", In INTERSPEECH-2009, 1243-1246.