English Conversational Telephone Speech Recognition by Humans and Machines

George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall


Word error rates on the Switchboard conversational corpus that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues: what is human performance, and how far down can we still drive speech recognition error rates? In trying to assess human performance, we performed an independent set of measurements on the Switchboard and CallHome subsets of the Hub5 2000 evaluation and found that human accuracy may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the WER of our system to 5.5%/10.3% on these subsets, which is a new performance milestone (albeit not at what we measure to be human performance). On the acoustic side, we use a score fusion of one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third convolutional residual net (ResNet). On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.


 DOI: 10.21437/Interspeech.2017-405

Cite as: Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L., Roomi, B., Hall, P. (2017) English Conversational Telephone Speech Recognition by Humans and Machines. Proc. Interspeech 2017, 132-136, DOI: 10.21437/Interspeech.2017-405.


@inproceedings{Saon2017,
  author={George Saon and Gakuto Kurata and Tom Sercu and Kartik Audhkhasi and Samuel Thomas and Dimitrios Dimitriadis and Xiaodong Cui and Bhuvana Ramabhadran and Michael Picheny and Lynn-Li Lim and Bergul Roomi and Phil Hall},
  title={English Conversational Telephone Speech Recognition by Humans and Machines},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={132--136},
  doi={10.21437/Interspeech.2017-405},
  url={http://dx.doi.org/10.21437/Interspeech.2017-405}
}