Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning

ShiLiang Zhang, Ming Lei


Recently, the connectionist temporal classification (CTC) based acoustic models have achieved comparable or even better performance, with much higher decoding efficiency, than the conventional hybrid systems in LVCSR tasks. For CTC-based models, it usually uses the LSTM-type networks as acoustic models. However, LSTMs are computationally expensive and sometimes difficult to train with CTC criterion. In this paper, inspired by the recent DFSMN works, we propose to replace the LSTMs with DFSMN in CTC-based acoustic modeling and explore how this type of non-recurrent models behave when trained with CTC loss. We have evaluated the performance of DFSMN-CTC using both context-independent (CI) and context-dependent (CD) phones as target labels in many LVCSR tasks with various amount of training data. Experimental results shown that DFSMN-CTC acoustic models using either CI-Phones or CD-Phones can significantly outperform the conventional hybrid models that trained with CD-Phones and cross-entropy (CE) criterion. Moreover, a novel joint CTC and CE training method is proposed, which enables to improve the stability of CTC training and performance. In a 20000 hours Mandarin recognition task, joint CTC-CE trained DFSMN can achieve a 11.0% and 30.1% relative performance improvement compared to DFSMN-CE models in a normal and fast speed test set respectively.


 DOI: 10.21437/Interspeech.2018-1049

Cite as: Zhang, S., Lei, M. (2018) Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning. Proc. Interspeech 2018, 771-775, DOI: 10.21437/Interspeech.2018-1049.


@inproceedings{Zhang2018,
  author={ShiLiang Zhang and Ming Lei},
  title={Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={771--775},
  doi={10.21437/Interspeech.2018-1049},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1049}
}