EUROSPEECH 2001 Scandinavia
We show how an iterative form of Laplace's method can be used to estimate the log-spectrum of clean speech from the log-spectrum of noisy, distorted speech, using a time-varying mixture model of the logspectra of the clean speech, noise, channel distortion and noisy speech. We use this method, called ALGONQUIN, to denoise speech features and then feed these features into a large vocabulary speech recognizer whose WER on the clean WSJ data is 4.9%. When 10dB of time-varying airplane engine noise is added to the data, the recognizer obtains a WER of 28.8%. ALGONQUIN reduces the WER to 12.6%, well below the WER of 25.0% obtained by spectral subtraction, and close to the WER of 9.7% obtained by retraining the recognizer on training data corrupted by the exact same noise. If ALGONQUIN is used to denoise the noisy training data before the recognizer is retrained, the WER drops to 8.5%. For 10dB of white noise, spectral subtraction reduces the WER from 55.1% to 33.8%. ALGONQUIN reduces the WER to 14.2%. The recognizer trained on noisy data obtains a WER of 14.0%, whereas the recognizer trained on noisy data denoised by ALGONQUIN obtains a WER of 9.9%.
Bibliographic reference. Frey, Brendan J. / Deng, Li / Acero, Alex / Kristjansson, Trausti (2001): "ALGONQUIN: iterating laplace's method to remove multiple types of acoustic distortion for robust speech recognition", In EUROSPEECH-2001, 901-904.