Since emotional speech can be regarded as a variation on neutral (non-emotional) speech, it is expected that a robust neutral speech model can be useful in contrasting different emotions expressed in speech. This study explores this idea by creating acoustic models trained with spectral features, using the emotionally-neutral TIMIT corpus. The performance is tested with two emotional speech databases: one recorded with a microphone (acted), and another recorded from a telephone application (spontaneous). It is found that accuracy up to 78% and 65% can be achieved in the binary and category emotion discriminations, respectively. Raw Mel Filter Bank (MFB) output was found to perform better than conventional MFCC, with both broad-band and telephone-band speech. These results suggest that well-trained neutral acoustic models can be effectively used as a front-end for emotion recognition, and once trained with MFB, it may reasonably work well regardless of the channel characteristics.
Bibliographic reference. Busso, Carlos / Lee, Sungbok / Narayanan, Shrikanth S. (2007): "Using neutral speech models for emotional speech analysis", In INTERSPEECH-2007, 2225-2228.