Robust DNN-Based VAD Augmented with Phone Entropy Based Rejection of Background Speech

Yuya Fujita, Ken-ichi Iso


We propose a DNN-based voice activity detector augmented by entropy based frame rejection. DNN-based VAD classifies a frame into speech or non-speech and achieves significantly higher VAD performance compared to conventional statistical model-based VAD. We observed that many of the remaining errors are false alarms caused by background human speech, such as TV/radio or surrounding peoples’ conversations. In order to reject such background speech frames, we introduce an entropy-based confidence measure using the phone posterior probability output by a DNN-based acoustic model. Compared to the target speaker’s voice background speech tends to have relatively unclear pronunciation or is contaminated by other types of noises so its entropy becomes larger than audio signals with only the target speaker’s voice. Combining DNN-based VAD and the entropy criterion, we reject speech frames classified by the DNN-based VAD as having an entropy larger than a threshold value. We have evaluated the proposed approach and confirmed greater than 10% reduction in Sentence Error Rate.


DOI: 10.21437/Interspeech.2016-136

Cite as

Fujita, Y., Iso, K. (2016) Robust DNN-Based VAD Augmented with Phone Entropy Based Rejection of Background Speech. Proc. Interspeech 2016, 3663-3667.

Bibtex
@inproceedings{Fujita+2016,
author={Yuya Fujita and Ken-ichi Iso},
title={Robust DNN-Based VAD Augmented with Phone Entropy Based Rejection of Background Speech},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-136},
url={http://dx.doi.org/10.21437/Interspeech.2016-136},
pages={3663--3667}
}