Situation informed end-to-end ASR for noisy environments

Siddharth Dalmia, Suyoun Kim, Florian Metze


This paper describes an end-to-end speech recognition system for the 5th CHiME challenge that addresses continuous conversation in everyday environments using distributed microphone arrays. The main contribution of our system is the investigation of an effective adaptation method within the end-to-end system based on speaker gender information, microphone array information, and conversational history information for better gen- eralization. Without using any speech enhancement technique, or data augmentation, or data cleaning up, or lexicon information, our proposed system produces better ASR performance than the baseline system (LF-MMI TDNN) which requires the lexicon information and a complicated conventional modeling process (i.e. HMM/GMM, triphone-based acoustic modeling, fMLLR, SAT, i-vector, Data cleaning up, etc). Our final ASR system achieves an absolute word error rate reduction of 12.6% on development set in comparison to the end-to-end baseline system, and an absolute word error rate reduction of 1.5% on evaluation set in comparison to conventional baseline system (LF-MMI TDNN) in a single-array track.


 DOI: 10.21437/CHiME.2018-11

Cite as: Dalmia, S., Kim, S., Metze, F. (2018) Situation informed end-to-end ASR for noisy environments. Proc. CHiME 2018 Workshop on Speech Processing in Everyday Environments, 49-52, DOI: 10.21437/CHiME.2018-11.


@inproceedings{Dalmia2018,
  author={Siddharth Dalmia and Suyoun Kim and Florian Metze},
  title={{Situation informed end-to-end ASR for noisy environments}},
  year=2018,
  booktitle={Proc. CHiME 2018 Workshop on Speech Processing in Everyday Environments},
  pages={49--52},
  doi={10.21437/CHiME.2018-11},
  url={http://dx.doi.org/10.21437/CHiME.2018-11}
}