We present the IBM speech activity detection system that was fielded in the phase 2 evaluation of the DARPA RATS (robust automatic transcription of speech) program. Key ingredients of the system are: multi-pass HMM Viterbi segmentation, fusion of multiple feature streams, file-based and speech-based normalization schemes, the use of regular and convolutional deep neural networks, and model fusion through frame-level score combination of channel-dependent models. These techniques were instrumental in achieving a 1.4% equal error rate on the RATS phase 2 evaluation data.
Bibliographic reference. Saon, George / Thomas, Samuel / Soltau, Hagen / Ganapathy, Sriram / Kingsbury, Brian (2013): "The IBM speech activity detection system for the DARPA RATS program", In INTERSPEECH-2013, 3497-3501.