Extract, Adapt and Recognize: An End-to-End Neural Network for Corrupted Monaural Speech Recognition

Max W.Y. Lam, Jun Wang, Xunying Liu, Helen Meng, Dan Su, Dong Yu


Automatic speech recognition (ASR) in challenging conditions, such as in the presence of interfering speakers or music, remains an unsolved problem. This paper presents Extract, Adapt, and Recognize (EAR), an end-to-end neural network that allows fully learnable separation and recognition components towards optimizing the ASR criterion. In between a state-of-the-art speech separation module as an extractor and an acoustic modeling module as a recognizer, the EAR introduces an adaptor, where adapted acoustic features are learned from the separation outputs using a bi-directional long short term memory network trained to minimize the recognition loss directly. Relative to a conventional joint training model, the EAR model can achieve 8.5% to 22.3%, and 1.2% to 26.9% word error rate reductions (WERR), under various dBs of music corruption and speaker interference respectively. With speaker tracing the WERR can be further promoted to 12.4% to 29.0%.


 DOI: 10.21437/Interspeech.2019-1626

Cite as: Lam, M.W., Wang, J., Liu, X., Meng, H., Su, D., Yu, D. (2019) Extract, Adapt and Recognize: An End-to-End Neural Network for Corrupted Monaural Speech Recognition. Proc. Interspeech 2019, 2778-2782, DOI: 10.21437/Interspeech.2019-1626.


@inproceedings{Lam2019,
  author={Max W.Y. Lam and Jun Wang and Xunying Liu and Helen Meng and Dan Su and Dong Yu},
  title={{Extract, Adapt and Recognize: An End-to-End Neural Network for Corrupted Monaural Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2778--2782},
  doi={10.21437/Interspeech.2019-1626},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1626}
}