Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition

Shiliang Zhang, Ming Lei, Zhijie Yan


Connectionist Temporal Classification (CTC) based end-to-end speech recognition system usually need to incorporate an external language model by using WFST-based decoding in order to achieve promising results. This is more essential to Mandarin speech recognition since it owns a special phenomenon, namely homophone, which causes a lot of substitution errors. The linguistic information introduced by language model is somehow helpful to distinguish these substitution errors. In this work, we propose a transformer based spelling correction model to automatically correct errors, especially the substitution errors, made by CTC-based Mandarin speech recognition system. Specifically, we investigate to use the recognition results generated by CTC-based systems as input and the ground-truth transcriptions as output to train a transformer with encoder-decoder architecture, which is much similar to machine translation. Experimental results in a 20,000 hours Mandarin speech recognition task show that the proposed spelling correction model can achieve a CER of 3.41%, which results in 22.9% and 53.2% relative improvement compared to the baseline CTC-based systems decoded with and without language model, respectively.


 DOI: 10.21437/Interspeech.2019-1290

Cite as: Zhang, S., Lei, M., Yan, Z. (2019) Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition. Proc. Interspeech 2019, 2180-2184, DOI: 10.21437/Interspeech.2019-1290.


@inproceedings{Zhang2019,
  author={Shiliang Zhang and Ming Lei and Zhijie Yan},
  title={{Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2180--2184},
  doi={10.21437/Interspeech.2019-1290},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1290}
}