Speech production variability due to whisper represents a major challenges for effective speech systems. Whisper is used by talkers intentionally in certain circumstances to protect personal privacy. Due to the absence of periodic excitation in the production of whisper, there are considerable differences between neutral and whispered speech in the spectral structure. Therefore, performance of speaker ID systems trained with high energy voiced phonemes, degrades significantly when tested with whisper. This study considers a combination of modified temporal patterns (m-TRAPs) and MFCCs to improve the performance of a neutral trained system for whispered speech. The m-TRAPs are introduced based on an explanation for the whisper/neutral mismatch degradation of an MFCC based system. A phoneme-by-phoneme score weighting method is used to fuse the score from each subband. Text independent closed set speaker ID was conducted and experimental results show that m-TRAPs are especially efficient for whisper with low SNR. When combining scores from both MFCC and TRAPs based GMMs, an absolute 26.3% improvement in accuracy is obtained compared with a traditional MFCC baseline system. This result confirms a viable approach to improving speaker ID performance between neutral/whisper mismatched conditions.
Bibliographic reference. Fan, Xing / Hansen, John H. L. (2009): "Speaker identification for whispered speech using modified temporal patterns and MFCCs", In INTERSPEECH-2009, 896-899.