ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

A comparison of neural network feature transforms for speaker diarization

Sree Harsha Yella, Andreas Stolcke

Speaker diarization finds contiguous speaker segments in an audio stream and clusters them by speaker identity, without using a-priori knowledge about the number of speakers or enrollment data. Diarization typically clusters speech segments based on short-term spectral features. In prior work, we showed that neural networks can serve as discriminative feature transformers for diarization by training them to perform same/different speaker comparisons on speech segments, yielding improved diarization accuracy when combined with standard MFCC-based models. In this work, we explore a wider range of neural network architectures for feature transformation, by adding additional layers and nonlinearities, and by varying the objective function during training. We find that the original speaker comparison network can be improved by adding a nonlinear transform layer, and that further gains are possible by training the network to perform speaker classification rather than comparison. Overall we achieve relative reductions in speaker error between 18% and 34% on a variety of test data from the AMI, ICSI, and NIST-RT corpora.

doi: 10.21437/Interspeech.2015-101

Cite as: Yella, S.H., Stolcke, A. (2015) A comparison of neural network feature transforms for speaker diarization. Proc. Interspeech 2015, 3026-3030, doi: 10.21437/Interspeech.2015-101

  author={Sree Harsha Yella and Andreas Stolcke},
  title={{A comparison of neural network feature transforms for speaker diarization}},
  booktitle={Proc. Interspeech 2015},