Machine Listening in Multisource Environments (CHiME) 2011
Convolutive non-negative matrix factorization (CNMF) is an effective approach for supervised audio source separation. It relies on the availability of sufficient training data to learn a set of bases for each acoustic source. For automatic speech recognition (ASR) in a multi-source noise environment, the varied nature of background noise makes it a challenging task to learn the noise bases and thereby to suppress it from the speech signal using CNMF. A large amount of training data is required to reliably capture noise variation, but this generally leads to an unacceptable computational burden. Here, we address this problem by learning the noise bases using a computationally efficient, online CNMF approach. By learning the noise bases from several hours of ambient noise data and over a few seconds of local acoustic context, we show that background noise can be effectively attenuated from noisy speech. ASR accuracies on the CHiME corpus with the denoised speech show relative improvements in the range of 42.3% for -6 dB signal-to-noise ratio (SNR) to 2.5% for 9 dB SNR.
Index Terms. Convolutive non-negative matrix factorization, online CNMF, speech separation, automatic speech recognition
Bibliographic reference. Vipperla, Ravichander / Bozonnet, Simon / Wang, Dong / Evans, Nicholas (2011): "Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization", In CHiME-2011, 74-79.