Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen


Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial. Previous works utilize linear interpolation or a fusion network to integrate external language models. However, these approaches introduce external components, and increase decoding computation. In this paper, we instead propose a knowledge distillation based training approach to integrating external language models into a sequence-to-sequence model. A recurrent neural network language model, which is trained on large scale external text, generates soft labels to guide the sequence-to-sequence model training. Thus, the language model plays the role of the teacher. This approach does not add any external component to the sequence-to-sequence model during testing. And this approach is flexible to be combined with shallow fusion technique together for decoding. The experiments are conducted on public Chinese datasets AISHELL-1 and CLMAD. Our approach achieves a character error rate of 9.3%, which is relatively reduced by 18.42% compared with the vanilla sequence-to-sequence model.


 DOI: 10.21437/Interspeech.2019-1554

Cite as: Bai, Y., Yi, J., Tao, J., Tian, Z., Wen, Z. (2019) Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition. Proc. Interspeech 2019, 3795-3799, DOI: 10.21437/Interspeech.2019-1554.


@inproceedings{Bai2019,
  author={Ye Bai and Jiangyan Yi and Jianhua Tao and Zhengkun Tian and Zhengqi Wen},
  title={{Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3795--3799},
  doi={10.21437/Interspeech.2019-1554},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1554}
}