15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Post-Masking: A Hybrid Approach to Array Processing for Speech Recognition

Amir R. Moghimi, Bhiksha Raj, Richard M. Stern

Carnegie Mellon University, USA

In the context of array processing for speech and audio applications, linear beamforming has long been the approach of choice, for reasons including good performance, robustness and analytical simplicity. Nevertheless, various nonlinear techniques, typically based on the study of auditory scene analysis, have also been of interest. The class of techniques known as time-frequency (T-F) masking, in particular, shows promise; T-F masking is based on accepting or rejecting individual time-frequency cells based on some estimate of local signal quality. While these approaches have been shown to outperform linear beamforming in two-sensor arrays, extensions to larger arrays have been few and unsuccessful. This paper seeks to gain a deeper understanding of the limitations of T-F masking in larger arrays and to develop an approach to overcome them. It is shown that combining beamforming and masking can bring the benefits of masking to larger arrays. As a result, a hybrid beamforming-masking approach, called post-masking, is developed that improves upon the performance of MMSE beamforming (and can be used with any beamforming technique). Post-masking extends the benefits of masking up to arrays of six elements or more, with the potential for even greater improvement in the future.

Full Paper

Bibliographic reference.  Moghimi, Amir R. / Raj, Bhiksha / Stern, Richard M. (2014): "Post-masking: a hybrid approach to array processing for speech recognition", In INTERSPEECH-2014, 2425-2429.