In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-based speaker segmentation system. As speaker change detection should be based on speaker information only, we want it to disregard the non-speech frames by applying speech posteriors during the estimation of the Baum-Welch statistics. The speaker segmentation relies on speaker factors which are extracted on a frame-by-frame basis using an eigenvoice matrix. Speaker boundaries are inserted at positions where the distance between the speaker factors at both sides is large. A Mahalanobis distance seems capable of suppressing the effects of differences in the phonetic content at both sides, and therefore, to generate more accurate speaker boundaries. This iVector-based segmentation significantly outperforms Bayesian Information Criterion (BIC) segmentation methods and can be made adaptive on a file-by-file basis in a two-pass approach. Experiments on the COST278 multilingual broadcast news database show significant reductions of the boundary detection error rate by integrating the soft VAD. Furthermore, the more accurate boundaries induce a slight improvement of the iVector Probabilistic Linear Discriminant Analysis system that is employed for speaker clustering.
Desplanques, B., Demuynck, K., Martens, J. (2016) Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News. Proc. Odyssey 2016, 158-165.
@inproceedings{Desplanques+2016, author={Brecht Desplanques and Kris Demuynck and Jean-Pierre Martens}, title={Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News}, year=2016, booktitle={Odyssey 2016}, doi={10.21437/Odyssey.2016-23}, url={http://dx.doi.org/10.21437/Odyssey.2016-23}, pages={158--165} }