Speech activity detection (SAD) is a conceptually simple task that still poses serious challenges for speech processing in a large variety of scenarios. Current energy-based and model-based approaches tend to directly segment speech and non-speech classes, but are not robust enough to non-stationary noise. In this paper, we use a multi-source activity detection (MSAD) approach to SAD by finding the activity levels of speech and a set of non-speech acoustic sources. Public talks such as TED involve a large variety of non-speech audio that is difficult to handle with standard SAD systems. We evaluate the effect of using either the proposed MSAD system versus a tailored version of the popular SHOUT SAD system. We evaluate our approach on a subset of the TED data to show the effectiveness of the technique, with and without a sparsity constraint on the vector of acoustic source activities.
Bibliographic reference. Ferràs, Marc / Bourlard, Hervé (2014): "Multi-source posteriors for speech activity detection on public talks", In INTERSPEECH-2014, 2529-2532.