We recently proposed a family of robust linear and nonlinear estimation techniques for recognizing the three emotion primitives--valence, activation, and dominance--from speech. These were based on both local and global speech duration, energy, MFCC and pitch features. This paper aims to study the relative importance of these four categories of acoustic features in this emotion estimation context. Three measures are considered: the number of features from each category when all features are used in selection, the mean absolute error (MAE) when each category is used separately, and the MAE when a category is excluded from feature selection. We find that the relative importance is in the order of MFCC > Energy = Pitch > Duration. Additionally, estimator fusion almost always improves performance, and locally weighted fusion always outperforms average fusion regardless of the number of features used.
Bibliographic reference. Wu, Dongrui / Parsons, Thomas D. / Narayanan, Shrikanth S. (2010): "Acoustic feature analysis in speech emotion primitives estimation", In INTERSPEECH-2010, 785-788.