In this paper, we propose a mask estimation method for a computational auditory scene analysis (CASA) based speech recognition front-end using speech obtained from two microphones. The proposed mask estimation method incorporates the observation that the mask information should be correlated over contiguous analysis time frames and adjacent frequency channels. To this end, two different hidden Markov models (HMMs), time HMM and frequency HMM, representing the time and frequency trajectories respectively, are trained using features such as the interaural time difference and the interaural level difference of two-channel signals. A mask for the given time-frequency bin is estimated by combining the likelihoods estimated from the two HMMs, and used to separate the desired speech from noisy speech. To show the effectiveness of the proposed mask estimation, we first measure the root mean square error between the ideal mask and that estimated by the proposed method. Then, we compare the performance of a speech recognition system using the proposed mask estimation method to those using conventional methods. Consequently, the proposed method provides an average word error rate reduction of 63.2% and 3.1% when compared with the Gaussian kernel-based and time HMM-based mask estimation methods, respectively.
Bibliographic reference. Park, Ji Hun / Yoon, Jae Sam / Kim, Hong Kook (2008): "Mask estimation incorporating time-frequency trajectories for a CASA-based ASR front-end", In INTERSPEECH-2008, 988-991.