Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT) analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information. These DCT/DCS features can be computed so as to emphasize frequency resolution or time resolution or a combination of the two factors. Several variations of the DCT/DCS features were evaluated with phonetic recognition experiments using TIMIT and its telephone version (NTIMIT). Best results obtained with a combined feature set are 73.85% for TIMIT and 62.5% for NTIMIT. The modulation features are shown to be far more important than the spectral features for automatic speech recognition and far more noise robust.
Bibliographic reference. Zahorian, Stephen A. / Hu, Hongbing / Chen, Zhengqing / Wu, Jiang (2009): "Spectral and temporal modulation features for phonetic recognition", In INTERSPEECH-2009, 1071-1074.