Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes noise-robust VAD using long-term temporal information in speech. Long-term temporal information has been an ASR focus recently, but has not been investigated sufficiently for VAD. This paper describes an attempt to incorporate long-term temporal information into a feature parameter set by extracting conventional dynamic features from long-term cepstrum sequences. The proposed method includes the temporal contexts of phonemes by using long-term features and allows distinguishing between speech and non-speech intervals. The long-term features calculated over the average phoneme duration provide noise robustness. In an experiment on the Japanese digit corpus, the proposed method led to considerable improvements over conventional methods including the G.729 Annex B and the ETSI AFE-VAD under low SNR conditions, and had 71.1% error reduction on average as compared to the ETSI AFE-VAD.
Bibliographic reference. Fukuda, Takashi / Ichikawa, Osamu / Nishimura, Masafumi (2008): "Phone-duration-dependent long-term dynamic features for a stochastic model-based voice activity detection", In INTERSPEECH-2008, 1293-1296.