Auxiliary Feature Based Adaptation of End-to-end ASR Systems

Marc Delcroix, Shinji Watanabe, Atsunori Ogawa, Shigeki Karita, Tomohiro Nakatani


Acoustic model adaptation has been widely used to adapt models to speakers or environments. For example, appending auxiliary features representing speakers such as i-vectors to the input of a deep neural network (DNN) is an effective way to realize unsupervised adaptation of DNN-hybrid automatic speech recognition (ASR) systems. Recently, end-to-end (E2E) models have been proposed as an alternative to conventional DNN-hybrid ASR systems. E2E models map a speech signal to a sequence of characters or words using a single neural network, which greatly simplifies the ASR pipeline. However, adaptation of E2E models has received little attention yet. In this paper, we investigate auxiliary feature based adaptation for encoder-decoder E2E models. We employ a recently proposed sequence summary network to compute auxiliary features instead of i-vectors, as it can be easily integrated into E2E models and keep the ASR pipeline simple. Indeed, the sequence summary network allows the auxiliary feature extraction module to be a part of the computational graph of the E2E model. We demonstrate that the proposed adaptation scheme consistently improves recognition performance of three publicly available recognition tasks.


 DOI: 10.21437/Interspeech.2018-1438

Cite as: Delcroix, M., Watanabe, S., Ogawa, A., Karita, S., Nakatani, T. (2018) Auxiliary Feature Based Adaptation of End-to-end ASR Systems. Proc. Interspeech 2018, 2444-2448, DOI: 10.21437/Interspeech.2018-1438.


@inproceedings{Delcroix2018,
  author={Marc Delcroix and Shinji Watanabe and Atsunori Ogawa and Shigeki Karita and Tomohiro Nakatani},
  title={Auxiliary Feature Based Adaptation of End-to-end ASR Systems},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2444--2448},
  doi={10.21437/Interspeech.2018-1438},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1438}
}