It is common knowledge that affective and emotion-related states are acoustically well modelled on a supra-segmental level. Nonetheless successes are reported for frame-level processing either by means of dynamic classification or multi-instance learning techniques. In this work a quantitative feature-type-wise comparison between frame-level and supra-segmental analysis is carried out for the recognition of interest in human conversational speech. To shed light on the respective differences the same classifier, namely Support-Vector-Machines, is used in both cases: once by clustering a ‘bag of frames’ of unknown sequence length employing Multi- Instance Learning techniques, and once by statistical functional application for the projection of the time series onto a static feature vector. As database serves the Audiovisual Interest Corpus of naturalistic interest.
Bibliographic reference. Schuller, Björn / Rigoll, Gerhard (2009): "Recognising interest in conversational speech - comparing bag of frames and supra-segmental features", In INTERSPEECH-2009, 1999-2002.