Interspeech'2005 - Eurospeech
Many recent performance improvements in speaker recognition using higher-level features, as demonstrated in the NIST Speaker Recognition Evaluation (SRE) task, rely on combinations of multiple systems modeling a large variety of features. The diversity of the large set of features starting from short-term acoustic spectrum features all the way to habitual word usage from a large set of speakers in a multitude of settings (acoustic environment, speaking style, quantities of enrollment/test data) results in a challenging model combination task. In this work, we are presenting a classdependent score combination technique that relies on clustering of both the target models and the test utterances in a vector space defined by a set of speaker-specific transformation parameters estimated during transcription of the talker's speech by automatic speech recognition (ASR). We show that significant performance gains are obtained by using the first few principal components of a model transform for clustering the speaker verification trials into classes for (target speaker, test utterance) pairs, and then training a separate combiner for each class. We report results on the NIST SRE 2004 and FISHER datasets.
Bibliographic reference. Ferrer, Luciana / Sönmez, Kemal / Kajarekar, Sachin (2005): "Class-dependent score combination for speaker recognition", In INTERSPEECH-2005, 2173-2176.