Multi-Task CTC Training with Auxiliary Feature Reconstruction for End-to-End Speech Recognition

Gakuto Kurata, Kartik Audhkhasi


We present a multi-task Connectionist Temporal Classification (CTC) training for end-to-end (E2E) automatic speech recognition with input feature reconstruction as an auxiliary task. Whereas the main task of E2E CTC training and the auxiliary reconstruction task share the encoder network, the auxiliary task tries to reconstruct the input feature from the encoded information. In addition to standard feature reconstruction, we distort the input feature only in the auxiliary reconstruction task, such as (1) swapping the former and latter parts of an utterance, or (2) using a part of an utterance by stripping the beginning or end parts. These distortions intentionally suppress long-span dependencies in the time domain, which avoids overfitting to the training data. We trained phone-based CTC and word-based CTC models with the proposed multi-task learning and demonstrated that it improves ASR accuracy on various test sets that are matched and unmatched with the training data.


 DOI: 10.21437/Interspeech.2019-1710

Cite as: Kurata, G., Audhkhasi, K. (2019) Multi-Task CTC Training with Auxiliary Feature Reconstruction for End-to-End Speech Recognition. Proc. Interspeech 2019, 1636-1640, DOI: 10.21437/Interspeech.2019-1710.


@inproceedings{Kurata2019,
  author={Gakuto Kurata and Kartik Audhkhasi},
  title={{Multi-Task CTC Training with Auxiliary Feature Reconstruction for End-to-End Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1636--1640},
  doi={10.21437/Interspeech.2019-1710},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1710}
}