ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Phone-duration-dependent long-term dynamic features for a stochastic model-based voice activity detection

Takashi Fukuda, Osamu Ichikawa, Masafumi Nishimura

Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes noise-robust VAD using long-term temporal information in speech. Long-term temporal information has been an ASR focus recently, but has not been investigated sufficiently for VAD. This paper describes an attempt to incorporate long-term temporal information into a feature parameter set by extracting conventional dynamic features from long-term cepstrum sequences. The proposed method includes the temporal contexts of phonemes by using long-term features and allows distinguishing between speech and non-speech intervals. The long-term features calculated over the average phoneme duration provide noise robustness. In an experiment on the Japanese digit corpus, the proposed method led to considerable improvements over conventional methods including the G.729 Annex B and the ETSI AFE-VAD under low SNR conditions, and had 71.1% error reduction on average as compared to the ETSI AFE-VAD.


doi: 10.21437/Interspeech.2008-311

Cite as: Fukuda, T., Ichikawa, O., Nishimura, M. (2008) Phone-duration-dependent long-term dynamic features for a stochastic model-based voice activity detection. Proc. Interspeech 2008, 1293-1296, doi: 10.21437/Interspeech.2008-311

@inproceedings{fukuda08_interspeech,
  author={Takashi Fukuda and Osamu Ichikawa and Masafumi Nishimura},
  title={{Phone-duration-dependent long-term dynamic features for a stochastic model-based voice activity detection}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={1293--1296},
  doi={10.21437/Interspeech.2008-311}
}