Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions

Xu Li, Junfeng Li, Yonghong Yan


Monaural speech segregation is an important problem in robust speech processing and has been formulated as a supervised learning problem. In supervised learning methods, the ideal binary mask (IBM) is usually used as the target because of its simplicity and large speech intelligibility gains. Recently, the ideal ratio mask (IRM) has been found to improve the speech quality over the IBM. However, the IRM was originally defined in anechoic conditions and did not consider the effect of reverberation. In this paper, the IRM is extended to reverberant conditions where the direct sound and early reflections of target speech are regarded as the desired signal. Deep neural networks (DNNs) is employed to estimate the extended IRM in the noisy reverberant conditions. The estimated IRM is then applied to the noisy reverberant mixture for speech segregation. Experimental results show that the estimated IRM provides substantial improvements in speech intelligibility and speech quality over the unprocessed mixture signals under various noisy and reverberant conditions.


 DOI: 10.21437/Interspeech.2017-549

Cite as: Li, X., Li, J., Yan, Y. (2017) Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions. Proc. Interspeech 2017, 1203-1207, DOI: 10.21437/Interspeech.2017-549.


@inproceedings{Li2017,
  author={Xu Li and Junfeng Li and Yonghong Yan},
  title={Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1203--1207},
  doi={10.21437/Interspeech.2017-549},
  url={http://dx.doi.org/10.21437/Interspeech.2017-549}
}