In this paper, we describe experiments on automatic Emotion Recognition using comparable speech corpora collected from real-life American English and German Interactive Voice Response systems. We compute the optimal set of acoustic and prosodic features for mono-, cross- and multi-lingual anger recognition, and analyze the differences. When an emotion recognition system is confronted with a language it has not been trained on we normally observe a severe system degradation. Analyzing this loss we report on strategies to combine the feature spaces with and without combining and retraining the mono-lingual systems. We report classification scores and feature sets for various cases, and estimate the relative importance of features on both databases. We compare the feature distribution and feature ranks by evaluating information gain ratio. After final system integration, we obtain a single bi-lingual anger recognition system which performs just as well as two separate mono-lingual systems on the test data.
Index Terms: emotion recognition, anger classification, IVR speech, IGR, acoustic prosodic features, speech processing
Cite as: Polzehl, T., Schmitt, A., Metze, F. (2010) Approaching multi-lingual emotion recognition from speech - on language dependency of acoustic/prosodic features for anger recognition. Proc. Speech Prosody 2010, paper 442
@inproceedings{polzehl10_speechprosody, author={Tim Polzehl and Alexander Schmitt and Florian Metze}, title={{Approaching multi-lingual emotion recognition from speech - on language dependency of acoustic/prosodic features for anger recognition}}, year=2010, booktitle={Proc. Speech Prosody 2010}, pages={paper 442} }