The IBM 2016 English Conversational Telephone Speech Recognition System

George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo


We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3×3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model “M” and hierarchical neural network LMs.


DOI: 10.21437/Interspeech.2016-1460

Cite as

Saon, G., Sercu, T., Rennie, S., Kuo, H.J. (2016) The IBM 2016 English Conversational Telephone Speech Recognition System. Proc. Interspeech 2016, 7-11.

Bibtex
@inproceedings{Saon+2016,
author={George Saon and Tom Sercu and Steven Rennie and Hong-Kwang J. Kuo},
title={The IBM 2016 English Conversational Telephone Speech Recognition System},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1460},
url={http://dx.doi.org/10.21437/Interspeech.2016-1460},
pages={7--11}
}