12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Speaker Identification for Whispered Speech Using a Training Feature Transformation from Neutral to Whisper

Xing Fan, John H. L. Hansen

University of Texas at Dallas, USA

A number of research studies in speaker recognition have recently focused on robustness due to microphone and channel mismatch( e.g., NIST SRE). However, changes in vocal effort, especially whispered speech, present significant challenges in maintaining system performance. Due to the mismatch spectral structure resulting from the different production mechanisms, performance of speaker identification systems trained with neutral speech degrades significantly when tested with whispered speech. This study considers a feature transformation method in the training phase that leads to a more robust speaker model for speaker ID with whispered speech. In the proposed system, a Speech Mode Independent (SMI) Universal Background Model (UBM) is built using collected real neutral features and pseudo whispered features generated with Vector Taylor Series (VTS), or via Constrained Maximum Likelihood Linear Regression (CMLLR) model adaptation. Textindependent closed set speaker ID results using the UT-VocalEffort II corpus show an accuracy of 88.87% using the proposed method, which represents a relative improvement of 46.26% compared with the 79.29% accuracy of the baseline system. This result confirms a viable approach to improving speaker ID performance for neutral and whispered speech mismatched conditions.

Full Paper

Bibliographic reference.  Fan, Xing / Hansen, John H. L. (2011): "Speaker identification for whispered speech using a training feature transformation from neutral to whisper", In INTERSPEECH-2011, 2425-2428.