In this study, the problem of in-set versus out-of-set speaker recognition for limited train/test data is addressed. Since enrollment data is so limited (5 sec), acoustic holes in the speaker phoneme space from training tokens will exist and must be filled. To achieve this, a cohort speaker selection process is developed that possess similar acoustic characteristics. The resulting GMM from common sentences are used to measure the speaker's acoustic similarity with the Kullback-Leibler (KL) distance. The likelihood ratio scores are employed to measure the speaker similarity when no common sentence structure exists. Gaussian components corresponding to the acoustic holes are harvested from the cohort model. Constructed using a phone recognition simulator with 65% accuracy, a comparison is made with the GMM employing common utterances with the TIMIT corpus. Finally, the combination of Gaussian components corresponding to acoustic holes and the common acoustic space are leveraged to improve overall system performance. The proposed acoustic hole filling algorithm is evaluated using speech from the TIMIT and FISHER corpora with the GMM-UBM as our baseline system. The proposed acoustic hole filling system is shown to improve performance by 25% and 13% over the baseline on TIMIT and FISHER. This advancement is a significant step forward in-set/out-of-set speaker recognition with limited train (5 sec) and test material (2.8 sec).
Bibliographic reference. Suh, Jun-Won / Angkititrakul, Pongtep / Hansen, John H. L. (2008): "Filling acoustic holes through leveraged uncorellated GMMs for in-set/out-of-set speaker recognition", In INTERSPEECH-2008, 1905-1908.