Exploiting Untranscribed Broadcast Data for Improved Code-Switching Detection

Emre Yılmaz, Henk van den Heuvel, David Van Leeuwen


We have recently presented an automatic speech recognition (ASR) system operating on Frisian-Dutch code-switched speech. This type of speech requires careful handling of unexpected language switches that may occur in a single utterance. In this paper, we extend this work by using some raw broadcast data to improve multilingually trained deep neural networks (DNN) that have been trained on 11.5 hours of manually annotated bilingual speech. For this purpose, we apply the initial ASR to the untranscribed broadcast data and automatically create transcriptions based on the recognizer output using different language models for rescoring. Then, we train new acoustic models on the combined data, i.e., the manually and automatically transcribed bilingual broadcast data, and investigate the automatic transcription quality based on the recognition accuracies on a separate set of development and test data. Finally, we report code-switching detection performance elaborating on the correlation between the ASR and the code-switching detection performance.


 DOI: 10.21437/Interspeech.2017-391

Cite as: Yılmaz, E., Heuvel, H.V.D., Leeuwen, D.V. (2017) Exploiting Untranscribed Broadcast Data for Improved Code-Switching Detection. Proc. Interspeech 2017, 42-46, DOI: 10.21437/Interspeech.2017-391.


@inproceedings{Yılmaz2017,
  author={Emre Yılmaz and Henk van den Heuvel and David Van Leeuwen},
  title={Exploiting Untranscribed Broadcast Data for Improved Code-Switching Detection},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={42--46},
  doi={10.21437/Interspeech.2017-391},
  url={http://dx.doi.org/10.21437/Interspeech.2017-391}
}