Machine Listening in Multisource Environments (CHiME) 2011

Florence, Italy
September 1, 2011

Speech Recognition in the Presence of Highly Non-Stationary Noise based on Spatial, Spectral and Temporal Speech/Noise Modeling Combined with Dynamic Variance Adaptation

Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Atsunori Ogawa, Takaaki Hori, Shinji Watanabe, Masakiyo Fujimoto, Takuya Yoshioka, Takanobu Oba, Yotaro Kubo, Mehrez Souden, Seong-Jun Hahm, Atsushi Nakamura

NTT Communication Science Laboratories, NTT Corporation, Japan

In this paper, we introduce a system for recognizing speech in the presence of multiple rapidly time-varying noise sources. The main components of the proposed approach are a modelbased speech enhancement pre-processor and an adaptation technique to optimize the integration between the pre-processor and the recognizer. The speech enhancement pre-processor consists of two complementary elements, a multi-channel speechnoise separation method that exploits spatial and spectral information, followed by single channel enhancement that uses the long-term temporal characteristics of speech. To compensate for any mismatch that may remain between the enhanced features and the acoustic model, we employ an adaptation technique that combines conventional MLLR with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system greatly improves the audible quality of speech and substantially improves of the keyword recognition accuracy.

Index Terms. Robust ASR, Source separation, Model-based speech enhancement, Example-based enhancement, Model adaptation, Dynamic variance adaptation

Full Paper     Slides

Bibliographic reference.  Delcroix, Marc / Kinoshita, Keisuke / Nakatani, Tomohiro / Araki, Shoko / Ogawa, Atsunori / Hori, Takaaki / Watanabe, Shinji / Fujimoto, Masakiyo / Yoshioka, Takuya / Oba, Takanobu / Kubo, Yotaro / Souden, Mehrez / Hahm, Seong-Jun / Nakamura, Atsushi (2011): "Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation", In CHiME-2011, 12-17.