Features derived from an auditory spectro-temporal representation of speech are proposed for robust far-field speaker identification. The auditory representation is obtained by first filtering the speech signal with a gammatone filterbank. A modulation filterbank is then applied to the temporal envelope of each gammatone filter output. Compared to commonly used mel-frequency cepstral coefficients (MFCC), the proposed features are shown to be more robust to mismatched conditions between enrollment and test data and are less sensitive to increasing reverberation time (RT). Experiments with simulated and recorded far-field speech show that a Gaussian mixture model based identification system, trained on the proposed features, attains an average improvement in identification accuracy of 15% relative to a system trained on MFCC. Improvements of up to 85% are attained for larger RT.
Bibliographic reference. Falk, Tiago H. / Chan, Wai-Yip (2008): "Spectro-temporal features for robust far-field speaker identification", In INTERSPEECH-2008, 634-637.