Semi-Supervised DNN Training with Word Selection for ASR

Karel Veselý, Lukáš Burget, Jan Černocký


Not all the questions related to the semi-supervised training of hybrid ASR system with DNN acoustic model were already deeply investigated. In this paper, we focus on the question of the granularity of confidences (per-sentence, per-word, per-frame), the question of how the data should be used (data-selection by masks, or in mini-batch SGD with confidences as weights). Then, we propose to re-tune the system with the manually transcribed data, both with the ‘frame CE’ training and ‘sMBR’ training.

Our preferred semi-supervised recipe which is both simple and efficient is following: we select words according to the word accuracy we obtain on the development set. Such recipe, which does not rely on a grid-search of the training hyper-parameter, generalized well for: Babel Vietnamese (transcribed 11h, untranscribed 74h), Babel Bengali (transcribed 11h, untranscribed 58h) and our custom Switchboard setup (transcribed 14h, untranscribed 95h). We obtained the absolute WER improvements 2.5% for Vietnamese, 2.3% for Bengali and 3.2% for Switchboard.


 DOI: 10.21437/Interspeech.2017-1385

Cite as: Veselý, K., Burget, L., Černocký, J. (2017) Semi-Supervised DNN Training with Word Selection for ASR. Proc. Interspeech 2017, 3687-3691, DOI: 10.21437/Interspeech.2017-1385.


@inproceedings{Veselý2017,
  author={Karel Veselý and Lukáš Burget and Jan Černocký},
  title={Semi-Supervised DNN Training with Word Selection for ASR},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3687--3691},
  doi={10.21437/Interspeech.2017-1385},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1385}
}