Stream Attention for Distributed Multi-Microphone Speech Recognition

Xiaofei Wang, Ruizhi Li, Hynek Hermansky


Exploiting multiple microphones has been a widely-used strategy for robust automatic speech recognition (ASR). Particularly, in a general hands-free scenario, acquisition of speech usually happens using a set of distributed microphones or arrays simultaneously. Each microphone or array (defined as a stream) carries a different quality of information. The technique of stream fusion is beneficial to provide the best distant recognition performance against the effects of potential disturbances such as noise, reverberation, as well as the speaker movement. In this work, we propose a stream attention framework to improve the far-field ASR performance in the distributed multi-microphone configuration. Frame-level attention vectors have been derived by predicting the ASR performance of the acoustic modeling of individual streams using the posterior probabilities from the classifier. They are used to characterize the amount of useful information each stream contributes, for the purpose of an efficient and better-performing decoding scheme. In this paper, we investigate the ASR performance measures using our proposed stream attention system on real recorded datasets, Mixer-6 and DIRHA-WSJ. The experimental results show that the proposed framework yields substantial improvements in word error rate (WER) compared to conventional strategies.


 DOI: 10.21437/Interspeech.2018-1037

Cite as: Wang, X., Li, R., Hermansky, H. (2018) Stream Attention for Distributed Multi-Microphone Speech Recognition. Proc. Interspeech 2018, 3033-3037, DOI: 10.21437/Interspeech.2018-1037.


@inproceedings{Wang2018,
  author={Xiaofei Wang and Ruizhi Li and Hynek Hermansky},
  title={Stream Attention for Distributed Multi-Microphone Speech Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3033--3037},
  doi={10.21437/Interspeech.2018-1037},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1037}
}