15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Limited Labels for Unlimited Data: Active Learning for Speaker Recognition

Stephen H. Shum, Najim Dehak, James R. Glass


In this paper, we attempt to quantify the amount of labeled data necessary to build a state-of-the-art speaker recognition system. We begin by using i-vectors and the cosine similarity metric to represent an unlabeled set of utterances, then obtain labels from a noiseless oracle in the form of pairwise queries. Finally, we use the resulting speaker clusters to train a PLDA scoring function, which is assessed on the 2010 NIST Speaker Recognition Evaluation. After presenting the initial results of an algorithm that sorts queries based on nearest-neighbor pairs, we develop techniques that further minimize the number of queries needed to obtain state-of-the-art performance. We show the generalizability of our methods in anecdotal fashion by applying our methods to two different distributions of utterances-per-speaker and, ultimately, find that the actual number of pairwise labels needed to obtain state-of-the-art results may be a mere fraction of the queries required to fully label the entire set of utterances.

Full Paper

Bibliographic reference.  Shum, Stephen H. / Dehak, Najim / Glass, James R. (2014): "Limited labels for unlimited data: active learning for speaker recognition", In INTERSPEECH-2014, 383-387.