ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Deep Learning-Based Telephony Speech Recognition in the Wild

Kyu J. Han, Seongjun Hahm, Byung-Hak Kim, Jungsuk Kim, Ian Lane

In this paper, we explore the effectiveness of a variety of Deep Learning-based acoustic models for conversational telephony speech, specifically TDNN, bLSTM and CNN-bLSTM models. We evaluated these models on both research testsets, such as Switchboard and CallHome, as well as recordings from a real-world call-center application. Our best single system, consisting of a single CNN-bLSTM acoustic model, obtained a WER of 5.7% on the Switchboard testset, and in combination with other models a WER of 5.3% was obtained. On the CallHome testset a WER of 10.1% was achieved with model combination. On the test data collected from real-world call-centers, even with model adaptation using application specific data, the WER was significantly higher at 15.0%. We performed an error analysis on the real-world data and highlight the areas where speech recognition still has challenges.

doi: 10.21437/Interspeech.2017-1695

Cite as: Han, K.J., Hahm, S., Kim, B.-H., Kim, J., Lane, I. (2017) Deep Learning-Based Telephony Speech Recognition in the Wild. Proc. Interspeech 2017, 1323-1327, doi: 10.21437/Interspeech.2017-1695

  author={Kyu J. Han and Seongjun Hahm and Byung-Hak Kim and Jungsuk Kim and Ian Lane},
  title={{Deep Learning-Based Telephony Speech Recognition in the Wild}},
  booktitle={Proc. Interspeech 2017},