14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Is Speech Enhancement Pre-Processing Still Relevant When Using Deep Neural Networks for Acoustic Modeling?

Marc Delcroix, Yotaro Kubo, Tomohiro Nakatani, Atsushi Nakamura

NTT Corporation, Japan

Using deep neural networks (DNNs) for automatic speech recognition (ASR) has recently attracted much attention due to the large performance improvement they provide for a variety of tasks. DNNs are known to be robust to overfitting and to be able to remove speaker variability. Another important cause of variability in speech is the presence of noise. A lot of research has been undertaken on noise robust ASR including front-end and back-end approaches. However most approaches have been developed or evaluated on traditional ASR systems based on Gaussian mixture models (GMMs). The question we try to address in this paper is whether conventional noise robust approaches can still be competitive when using recent DNN-based ASR systems. To this end, we compare experimentally the performance of DNN-based ASR systems in a distant speech recognition task, for DNNs trained with noise-free, noisy and enhanced speech. We confirm that DNNs are powerful when the training and testing conditions are well matched. However, the performance degrades in the presence of noise. The use of a speech enhancement pre-processor to reduce the noise variability significantly improves performance with performance improvement comparable to that observed with conventional GMM-based ASR systems.

Full Paper

Bibliographic reference.  Delcroix, Marc / Kubo, Yotaro / Nakatani, Tomohiro / Nakamura, Atsushi (2013): "Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling?", In INTERSPEECH-2013, 2992-2996.