In recent years, speech enhancement by analysis-resynthesis has emerged as an alternative to conventional noise filtering approaches. Analysis-resynthesis replaces noisy speech with a signal that has been reconstructed from a clean speech model. It can deliver high-quality signals with no residual noise, but at the expense of losing information from the original signal that is not well-represented by the model. A recent compromise solution, called constrained resynthesis, solves this problem by only resynthesising spectro-temporal regions that are estimated to be masked by noise (conditioned on the evidence in the unmasked regions). In this paper we first extend the approach by: i) introducing multi-condition training and a deep discriminative model for the analysis stage; ii) introducing an improved resynthesis model that captures within-state cross-frequency dependencies. We then extend the previous stationary-noise evaluation by using real domestic audio noise from the CHiME-2 evaluation. We compare various mask estimation strategies while varying the degree of constraint by tuning the threshold for reliable speech detection. PESQ and log-spectral distance measures show that although mask estimation remains a challenge, it is only necessary to estimate a few reliable signal regions in order to achieve performance close to that achieved with an optimal oracle mask.
Cite as: Marxer, R., Barker, J. (2017) Binary Mask Estimation Strategies for Constrained Imputation-Based Speech Enhancement. Proc. Interspeech 2017, 1988-1992, doi: 10.21437/Interspeech.2017-1257
@inproceedings{marxer17_interspeech, author={Ricard Marxer and Jon Barker}, title={{Binary Mask Estimation Strategies for Constrained Imputation-Based Speech Enhancement}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1988--1992}, doi={10.21437/Interspeech.2017-1257} }