7th International Conference on Spoken Language Processing
September 16-20, 2002
We present three voice activity detection (VAD) algorithms that are suitable for the off-line processing of noisy speech and compare their performance on SPINE-2 evaluation data using speech recognition error rate as the quality metric. One VAD system is a simple HMM- based segmenter that uses normalized log-energy and a degree of voicing measure as raw features. The other two VAD systems focus on frequency-localized temporal information in the speech signal using a TempoRAl PatternS (TRAPS) classifier. They differ only in the processing of the TRAPS output. One VAD system uses median filtering to generate segment hypotheses, while the other is a hybrid system that uses a Viterbi search identical to that used in the HMM segmenter. Recognition on the hybrid HMM/TRAPS segmentation is more accurate than recognition on the other two segmentations by 1% absolute. This difference is statistically significant at a 99% con- fidence level according to a matched pairs sentence-segment word error test.
Bibliographic reference. Kingsbury, Brian / Jain, Pratibha / Adami, Andre (2002): "A hybrid HMM/traps model for robust voice activity detection", In ICSLP-2002, 1073-1076.