Framewise Supervised Training Towards End-to-End Speech Recognition Models: First Results

Mohan Li, Yuanjiang Cao, Weicong Zhou, Min Liu


Recurrent neural networks (RNNs) trained with connectionist temporal classification (CTC) technique have delivered promising results in many speech recognition tasks. However, the forward-backward algorithm that CTC takes for model optimization requires a huge amount of computation. This paper introduces a new training method towards RNN-based end-to-end models, which significantly saves computing power without losing accuracy. Unlike CTC, the label sequence is aligned to the labelling hypothesis and then to the input sequence by the Weighted Minimum Edit-Distance Aligning (WMEDA) algorithm. Based on the alignment, the framewise supervised training is conducted. Moreover, Pronunciation Embedding (PE), the acoustic representation towards a linguistic target, is proposed in order to calculate the weights in WMEDA algorithm. The model is evaluated on TIMIT and AIShell-1 datasets for English phoneme and Chinese character recognitions. For TIMIT, the model achieves a comparable 18.57% PER to the 18.4% PER of the CTC baseline. As for AIShell-1, a joint Pinyin-character model is trained, giving a 19.38% CER, which is slightly better than the 19.43% CER obtained by the CTC character model, and the training time of this model is only 54.3% of the CTC model’s.


 DOI: 10.21437/Interspeech.2019-1117

Cite as: Li, M., Cao, Y., Zhou, W., Liu, M. (2019) Framewise Supervised Training Towards End-to-End Speech Recognition Models: First Results. Proc. Interspeech 2019, 1641-1645, DOI: 10.21437/Interspeech.2019-1117.


@inproceedings{Li2019,
  author={Mohan Li and Yuanjiang Cao and Weicong Zhou and Min Liu},
  title={{Framewise Supervised Training Towards End-to-End Speech Recognition Models: First Results}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1641--1645},
  doi={10.21437/Interspeech.2019-1117},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1117}
}