Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News

Brecht Desplanques, Kris Demuynck, Jean-Pierre Martens

In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-based speaker segmentation system. As speaker change detection should be based on speaker information only, we want it to disregard the non-speech frames by applying speech posteriors during the estimation of the Baum-Welch statistics. The speaker segmentation relies on speaker factors which are extracted on a frame-by-frame basis using an eigenvoice matrix. Speaker boundaries are inserted at positions where the distance between the speaker factors at both sides is large. A Mahalanobis distance seems capable of suppressing the effects of differences in the phonetic content at both sides, and therefore, to generate more accurate speaker boundaries. This iVector-based segmentation significantly outperforms Bayesian Information Criterion (BIC) segmentation methods and can be made adaptive on a file-by-file basis in a two-pass approach. Experiments on the COST278 multilingual broadcast news database show significant reductions of the boundary detection error rate by integrating the soft VAD. Furthermore, the more accurate boundaries induce a slight improvement of the iVector Probabilistic Linear Discriminant Analysis system that is employed for speaker clustering.

DOI: 10.21437/Odyssey.2016-23

Cite as

Desplanques, B., Demuynck, K., Martens, J. (2016) Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News. Proc. Odyssey 2016, 158-165.

author={Brecht Desplanques and Kris Demuynck and Jean-Pierre Martens},
title={Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News},
booktitle={Odyssey 2016},