Speaker Diarization Using Convolutional Neural Network for Statistics Accumulation Refinement

Zbyněk Zajíc, Marek Hrúz, Luděk Müller


The aim of this paper is to investigate the benefit of information from a speaker change detection system based on Convolutional Neural Network (CNN) when applied to the process of accumulation of statistics for an i-vector generation. The investigation is carried out on the problem of diarization. In our system, the output of the CNN is a probability value of a speaker change in a conversation for a given time segment. According to this probability, we cut the conversation into short segments that are then represented by the i-vector (to describe a speaker in it). We propose a technique to utilize the information from the CNN for the weighting of the acoustic data in a segment to refine the statistics accumulation process. This technique enables us to represent the speaker better in the final i-vector. The experiments on the English part of the CallHome corpus show that our proposed refinement of the statistics accumulation is beneficial with the relative improvement of Diarization Error Rate almost by 16% when compared to the speaker diarization system without statistics refinement.


 DOI: 10.21437/Interspeech.2017-51

Cite as: Zajíc, Z., Hrúz, M., Müller, L. (2017) Speaker Diarization Using Convolutional Neural Network for Statistics Accumulation Refinement. Proc. Interspeech 2017, 3562-3566, DOI: 10.21437/Interspeech.2017-51.


@inproceedings{Zajíc2017,
  author={Zbyněk Zajíc and Marek Hrúz and Luděk Müller},
  title={Speaker Diarization Using Convolutional Neural Network for Statistics Accumulation Refinement},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3562--3566},
  doi={10.21437/Interspeech.2017-51},
  url={http://dx.doi.org/10.21437/Interspeech.2017-51}
}