In this paper we discuss the speech activity detection system that we used for detecting speech regions in the Dutch TRECVID video collection. The system is designed to filter non-speech like music or sound effects out of the signal without the use of predefined non-speech models. Because the system trains its models on-line, it is robust for handling out-of-domain data. The speech activity error rate on an out-of-domain test set, recordings of English conference meetings, was 4.4%. The overall error rate on twelve randomly selected five minute TRECVID fragments was 11.5%.
Bibliographic reference. Huijbregts, Marijn / Wooters, Chuck / Ordelman, Roeland (2007): "Filtering the unknown: speech activity detection in heterogeneous video collections", In INTERSPEECH-2007, 2925-2928.