Artificial Neural Network-Based Feature Combination for Spatial Voice Activity Detection

Stefan Meier, Walter Kellermann


For many applications in speech communications and speech-based human-machine interaction, a reliable Voice Activity Detection (VAD) is crucial. Conventional methods for VAD typically differentiate between a target speaker and background noise by exploiting characteristic properties of speech signals. If a target speaker should be distinguished from other speech sources, these conventional concepts are no longer applicable, and other methods, typically exploiting the spatial diversity of the individual sources, are required. Often, it is beneficial to combine several features in order to improve the overall decision. Optimum combinations of features, however, depend strongly on the scenario, especially on the position of the target source, the characteristics of noise and interference and the Signal-to-Interference Ratio (SIR). Moreover, choosing detection thresholds which are robust to changing scenarios is often a difficult problem. In this paper, these issues are addressed by introducing Artificial Neural Networks (ANNs) for spatial voice activity detection, which allow to combine several features with background information. The experimental results show that already small ANNs can significantly and robustly improve the detection rates, offering a valuable tool for VAD.


DOI: 10.21437/Interspeech.2016-1184

Cite as

Meier, S., Kellermann, W. (2016) Artificial Neural Network-Based Feature Combination for Spatial Voice Activity Detection. Proc. Interspeech 2016, 2987-2991.

Bibtex
@inproceedings{Meier+2016,
author={Stefan Meier and Walter Kellermann},
title={Artificial Neural Network-Based Feature Combination for Spatial Voice Activity Detection},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1184},
url={http://dx.doi.org/10.21437/Interspeech.2016-1184},
pages={2987--2991}
}