10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Spectral and Temporal Modulation Features for Phonetic Recognition

Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu

Binghamton University, USA

Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT) analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information. These DCT/DCS features can be computed so as to emphasize frequency resolution or time resolution or a combination of the two factors. Several variations of the DCT/DCS features were evaluated with phonetic recognition experiments using TIMIT and its telephone version (NTIMIT). Best results obtained with a combined feature set are 73.85% for TIMIT and 62.5% for NTIMIT. The modulation features are shown to be far more important than the spectral features for automatic speech recognition and far more noise robust.

Full Paper

Bibliographic reference.  Zahorian, Stephen A. / Hu, Hongbing / Chen, Zhengqing / Wu, Jiang (2009): "Spectral and temporal modulation features for phonetic recognition", In INTERSPEECH-2009, 1071-1074.