End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition

Suyoun Kim, Ian Lane


End-to-End speech recognition is a recently proposed approach that directly transcribes input speech to text using a single model. End-to-End speech recognition methods including Connectionist Temporal Classification and Attention-based Encoder Decoder Networks have been shown to obtain state-of-the-art performance on a number of tasks and significantly simplify the modeling, training and decoding procedures for speech recognition. In this paper, we extend our prior work on End-to-End speech recognition focusing on the effectiveness of these models in far-field environments. Specifically, we propose introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources. We evaluate our proposed model on the CHiME-4 task, and show substantial improvement compared to a model optimized for a single microphone input.


 DOI: 10.21437/Interspeech.2017-1536

Cite as: Kim, S., Lane, I. (2017) End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition. Proc. Interspeech 2017, 3867-3871, DOI: 10.21437/Interspeech.2017-1536.


@inproceedings{Kim2017,
  author={Suyoun Kim and Ian Lane},
  title={End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3867--3871},
  doi={10.21437/Interspeech.2017-1536},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1536}
}