A Multistage Training Framework for Acoustic-to-Word Model

Chengzhu Yu, Chunlei Zhang, Chao Weng, Jia Cui, Dong Yu


Acoustic-to-word (A2W) prediction model based on Connectionist Temporal Classification (CTC) criterion has gained increasing interest in recent studies. Although previous studies have shown that A2W system could achieve competitive Word Error Rate (WER), there is still performance gap compared with the conventional speech recognition system when the amount of training data is not exceptionally large. In this study, we empirically investigate advanced model initializations and training strategies to achieve competitive speech recognition performance on 300 hour subset of the Switchboard task (SWB-300Hr). We first investigate the use of hierarchical CTC pretraining for improved model initialization. We also explore curriculum training strategy to gradually increase the target vocabulary size from 10k to 20k. Finally, joint CTC and Cross Entropy (CE) training techniques are studied to further improve the performance of A2W system. The combination of hierarchical-CTC model initialization, curriculum training and joint CTC-CE training translates to a relative of 12.1% reduction in WER. Our final A2W system evaluated on Hub5-2000 test sets achieves a WER of 11.4/20.8 for Switchboard and CallHome parts without using language model and decoder.


 DOI: 10.21437/Interspeech.2018-1452

Cite as: Yu, C., Zhang, C., Weng, C., Cui, J., Yu, D. (2018) A Multistage Training Framework for Acoustic-to-Word Model. Proc. Interspeech 2018, 786-790, DOI: 10.21437/Interspeech.2018-1452.


@inproceedings{Yu2018,
  author={Chengzhu Yu and Chunlei Zhang and Chao Weng and Jia Cui and Dong Yu},
  title={A Multistage Training Framework for Acoustic-to-Word Model},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={786--790},
  doi={10.21437/Interspeech.2018-1452},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1452}
}