DNN-Based Speaker Clustering for Speaker Diarisation

Rosanna Milner, Thomas Hain

Speaker diarisation, the task of answering “who spoke when?”, is often considered to consist of three independent stages: speech activity detection, speaker segmentation and speaker clustering. These represent the separation of speech and non-speech, the splitting into speaker homogeneous speech segments, followed by grouping together those which belong to the same speaker. This paper is concerned with speaker clustering, which is typically performed by bottom-up clustering using the Bayesian information criterion (BIC). We present a novel semi-supervised method of speaker clustering based on a deep neural network (DNN) model. A speaker separation DNN trained on independent data is used to iteratively relabel the test data set. This is achieved by reconfiguration of the output layer, combined with fine tuning in each iteration. A stopping criterion involving posteriors as confidence scores is investigated. Results are shown on a meeting task (RT07) for single distant microphones and compared with standard diarisation approaches. The new method achieves a diarisation error rate (DER) of 14.8%, compared to a baseline of 19.9%.

DOI: 10.21437/Interspeech.2016-126

Cite as

Milner, R., Hain, T. (2016) DNN-Based Speaker Clustering for Speaker Diarisation. Proc. Interspeech 2016, 2185-2189.

author={Rosanna Milner and Thomas Hain},
title={DNN-Based Speaker Clustering for Speaker Diarisation},
booktitle={Interspeech 2016},