In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-based speaker segmentation system. As speaker change detection should be based on speaker information only, we want it to disregard the non-speech frames by applying speech posteriors during the estimation of the Baum-Welch statistics. The speaker segmentation relies on speaker factors which are extracted on a frame-by-frame basis using an eigenvoice matrix. Speaker boundaries are inserted at positions where the distance between the speaker factors at both sides is large. A Mahalanobis distance seems capable of suppressing the effects of differences in the phonetic content at both sides, and therefore, to generate more accurate speaker boundaries. This iVector-based segmentation significantly outperforms Bayesian Information Criterion (BIC) segmentation methods and can be made adaptive on a file-by-file basis in a two-pass approach. Experiments on the COST278 multilingual broadcast news database show significant reductions of the boundary detection error rate by integrating the soft VAD. Furthermore, the more accurate boundaries induce a slight improvement of the iVector Probabilistic Linear Discriminant Analysis system that is employed for speaker clustering.
Cite as: Desplanques, B., Demuynck, K., Martens, J.-P. (2016) Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News. Proc. The Speaker and Language Recognition Workshop (Odyssey 2016), 158-165, doi: 10.21437/Odyssey.2016-23
@inproceedings{desplanques16_odyssey, author={Brecht Desplanques and Kris Demuynck and Jean-Pierre Martens}, title={{Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News}}, year=2016, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2016)}, pages={158--165}, doi={10.21437/Odyssey.2016-23} }