We propose a novel technique for conducting robust voice activity detection (VAD) in high-noise recordings. We use Gaussian mixture modeling (GMM) to train two generic models; speech and non-speech. We then score smaller segments of a given (unseen) recording against each of these GMMs to obtain two respective likelihood scores for each segment. These scores are used to compute a dissimilarity measure between pairs of segments and to carry out complete-linkage clustering of the segments into speech and non-speech clusters. We compare the accuracy of our method against state-of-the-art and standardised VAD techniques to demonstrate an absolute improvement of 15% in half-total error rate (HTER) over the best performing baseline system and across the QUT-NOISE-TIMIT database. We then apply our approach to the Audio-Visual Database of American English (AVDBAE) to demonstrate the performance of our algorithm in using visual, audio-visual or a proposed fusion of these features.
Cite as: Ghaemmaghami, H., Dean, D., Kalantari, S., Sridharan, S., Fookes, C. (2015) Complete-linkage clustering for voice activity detection in audio and visual speech. Proc. Interspeech 2015, 2292-2296, doi: 10.21437/Interspeech.2015-444
@inproceedings{ghaemmaghami15_interspeech, author={Houman Ghaemmaghami and David Dean and Shahram Kalantari and Sridha Sridharan and Clinton Fookes}, title={{Complete-linkage clustering for voice activity detection in audio and visual speech}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={2292--2296}, doi={10.21437/Interspeech.2015-444} }