ISCA Archive Interspeech 2013
ISCA Archive Interspeech 2013

Speech activity detection on youtube using deep neural networks

Neville Ryant, Mark Liberman, Jiahong Yuan

Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize well to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution is to augment the conventional cepstral features with additional, hand-engineered features (e.g., spectral flux, spectral centroid, multiband spectral entropies) which are robust to changes in environment and record- ing condition. An alternative approach, explored here, is to learn robust features during the course of training using an appropriate architecture such as deep neural networks (DNNs). In this paper we demonstrate that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates (19.6%) on YouTube videos compared to a conventional GMM based system (40%).


doi: 10.21437/Interspeech.2013-203

Cite as: Ryant, N., Liberman, M., Yuan, J. (2013) Speech activity detection on youtube using deep neural networks. Proc. Interspeech 2013, 728-731, doi: 10.21437/Interspeech.2013-203

@inproceedings{ryant13_interspeech,
  author={Neville Ryant and Mark Liberman and Jiahong Yuan},
  title={{Speech activity detection on youtube using deep neural networks}},
  year=2013,
  booktitle={Proc. Interspeech 2013},
  pages={728--731},
  doi={10.21437/Interspeech.2013-203}
}