Acoustic Model Bootstrapping Using Semi-Supervised Learning

Langzhou Chen, Volker Leutnant


This work aims at bootstrapping acoustic model training for automatic speech recognition with small amounts of human-labeled speech data and large amounts of machine-labeled speech data. Semi-supervised learning is investigated to select the machine-transcribed training samples. Two semi-supervised learning methods are proposed: one is the local-global uncertainty based method which introduces both the local uncertainty from the current utterance and the global uncertainty from the whole data pool into the data selection; the other is the margin based data selection, which selects the utterances near to the decision boundary through language model tuning. The experimental results based on a Japanese far-field automatic speech recognition system indicate that the acoustic model trained by automatically transcribed speech data achieve about 17% relative gain when in-domain human annotated data was not available for initialization. While 3.7% relative gain was obtained when the initial acoustic model was trained by small amount of in-domain data.


 DOI: 10.21437/Interspeech.2019-2818

Cite as: Chen, L., Leutnant, V. (2019) Acoustic Model Bootstrapping Using Semi-Supervised Learning. Proc. Interspeech 2019, 3198-3202, DOI: 10.21437/Interspeech.2019-2818.


@inproceedings{Chen2019,
  author={Langzhou Chen and Volker Leutnant},
  title={{Acoustic Model Bootstrapping Using Semi-Supervised Learning}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3198--3202},
  doi={10.21437/Interspeech.2019-2818},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2818}
}