Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g., affective vocal expressions) are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilizes the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to segment length, and as a special case, a representation for prosody is considered. Speaker independent classification results using 23;-SVM with the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was 31.7%. It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.
Index Terms: Emotion Classification, Constant-Q, 2D-DCT, supra-segmental, mean pitch estimation, prosody
Cite as: Neiberg, D., Laukka, P., Ananthakrishnan, G. (2010) Classification of affective speech using normalized time-frequency cepstra. Proc. Speech Prosody 2010, paper 071
@inproceedings{neiberg10_speechprosody, author={D. Neiberg and P. Laukka and G. Ananthakrishnan}, title={{Classification of affective speech using normalized time-frequency cepstra}}, year=2010, booktitle={Proc. Speech Prosody 2010}, pages={paper 071} }