Additional sub-phrase level information is believed to improve accuracy in speech emotion recognition systems. Yet, automatic segmentation is a challenge on its own considering word- or syllable boundaries. Further more clarification is needed which timing level leads to optimal results. In this paper we therefore quantitatively discuss three approaches to segment-level features based on 276 statistical hi-level prosodic, articulatory and speech quality features. Apart from the choice of the optimal segmentation scheme also fusion of segments with respect to classification and combination of diverse timing levels is analyzed. Tests are carried out on the popular Berlin Database of Emotional Speech (EMO-DB). Significant improvement over existing works can be reported for combination of phrase-level features with relative time interval features.
Cite as: Schuller, B., Rigoll, G. (2006) Timing levels in segment-based speech emotion recognition. Proc. Interspeech 2006, paper 1695-Wed2BuP.8, doi: 10.21437/Interspeech.2006-502
@inproceedings{schuller06b_interspeech, author={Björn Schuller and Gerhard Rigoll}, title={{Timing levels in segment-based speech emotion recognition}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1695-Wed2BuP.8}, doi={10.21437/Interspeech.2006-502} }