Far-Field Speech Enhancement Using Heteroscedastic Autoencoder for Improved Speech Recognition

Shashi Kumar, Shakti P. Rath


Automatic speech recognition (ASR) systems trained on clean speech do not perform well in far-field scenario. Degradation in word error rate (WER) can be as large as 40% in this mismatched scenario. Typically, speech enhancement is applied to map speech from far-field condition to clean condition using a neural network, commonly known as denoising autoencoder (DA). Such speech enhancement technique has shown significant improvement in ASR accuracy. It is a common practice to use mean-square error (MSE) loss to train DA which is based on regression model with residual noise modeled by zero-mean and constant co-variance Gaussian distribution. However, both these assumptions are not optimal, especially in highly non-stationary noisy and far-field scenario. Here, we propose a more generalized loss based on non-zero mean and heteroscedastic co-variance distribution for the residual variables. On the top, we present several novel DA architectures that are more suitable for the heteroscedastic loss. It is shown that the proposed methods outperform the conventional DA and MSE loss by a large margin. We observe relative improvement of 7.31% in WER compared to conventional DA and overall, a relative improvement of 14.4% compared to mismatched train and test scenario.


 DOI: 10.21437/Interspeech.2019-2032

Cite as: Kumar, S., Rath, S.P. (2019) Far-Field Speech Enhancement Using Heteroscedastic Autoencoder for Improved Speech Recognition. Proc. Interspeech 2019, 446-450, DOI: 10.21437/Interspeech.2019-2032.


@inproceedings{Kumar2019,
  author={Shashi Kumar and Shakti P. Rath},
  title={{Far-Field Speech Enhancement Using Heteroscedastic Autoencoder for Improved Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={446--450},
  doi={10.21437/Interspeech.2019-2032},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2032}
}