INTERSPEECH 2009

The ideal binary mask, often used in robust speech recognition applications, requires an estimate of the local SNR in each timefrequency (TF) unit. A datadriven approach is proposed for estimating the instantaneous SNR of each TF unit. By assuming that the a priori SNR and a posteriori SNR are uniformly distributed within a small region, the instantaneous SNR is estimated by minimizing the localized Bayes risk. The binary mask estimator derived by the proposed approach is evaluated in terms of hit and false alarm rates. Compared to the binary mask estimator that uses the decisiondirected approach to compute the SNR, the proposed datadriven approach yielded substantial improvements (up to 40%) in classification performance, when assessed in terms of a sensitivity metric which is based on the difference between the hit and false alarm rates.
Bibliographic reference. Kim, Gibak / Loizou, Philipos C. (2009): "A datadriven approach for estimating the timefrequency binary mask", In INTERSPEECH2009, 844847.