CRIM's System for the MGB-3 English Multi-Genre Broadcast Media Transcription

Vishwa Gupta, Gilles Boulianne


The second English Multi-Genre Broadcast Challenge (MGB-3) is a controlled evaluation of speech recognition and lightly supervised alignment using BBC TV recordings. CRIM is participating in the speech recognition part of the challenge. This paper presents CRIM's contributions to the MGB-3 transcription task. This task is inherently more difficult than the first task as the training audio has been reduced from 1200 hours to 500 hours. CRIM's main contributions are experimentation with bidirectional LSTM models and lattice-free MMI (LF-MMI) trained TDNN models for acoustic modeling, LSTM and DNN models for speech/non-speech detection for input to speaker diarization and LSTM language models for rescoring lattices. We also show that adding senone posteriors to the input of LSTM and DNN models for speech/non-speech detection (VAD) reduces error rate. CRIM's best single decoding WER for the MGB-3 dev17 dev set (with reference segmentation) went down from 27.6% (with our MGB-1 challenge system) to 24.1% for this task using the LF-MMI trained TDNN models. The final WER on dev17 set (after VAD) is 20.9% and on the new dev18 development set is 20.8%.


 DOI: 10.21437/Interspeech.2018-2079

Cite as: Gupta, V., Boulianne, G. (2018) CRIM's System for the MGB-3 English Multi-Genre Broadcast Media Transcription. Proc. Interspeech 2018, 2653-2657, DOI: 10.21437/Interspeech.2018-2079.


@inproceedings{Gupta2018,
  author={Vishwa Gupta and Gilles Boulianne},
  title={CRIM's System for the MGB-3 English Multi-Genre Broadcast Media Transcription},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2653--2657},
  doi={10.21437/Interspeech.2018-2079},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2079}
}