This paper investigates the contribution of features which convey long-term spectro-temporal (ST) information for the purpose of automatic emotional speech classification. The ST representation is obtained by means of a modulation filterbank decomposition of long-term temporal envelopes of the outputs of a gammatone filterbank. The two-dimensional discrete cosine transform is used to reduce the dimensionality of the representation; candidate features are then derived from statistics computed from the DCT coefficients. Sequential forward feature selection is used to select the most salient features. Two types of experiments are described which use the Berlin emotional speech database to test the performance of the ST features alone and in combination with prosodic features. In a multi-class experiment, simulation results with a support vector classifier show that a 44% reduction in classification error is attained once prosodic features are combined with the proposed ST features. Additionally, in a one-against-all experiment, an average increase in F-score of 33% is attained when the proposed ST features are included.
Bibliographic reference. Wu, Siqing / Falk, Tiago H. / Chan, Wai-Yip (2008): "Long-term spectro-temporal information for improved automatic speech emotion classification", In INTERSPEECH-2008, 638-641.