It is not fully known how long it takes a human to reliably recognize emotion in speech from the beginning of a phrase. However, many technical applications demand for very quick system responses, e.g. to prepare different feedback alternatives before the end of a speaker turn in a dialog system. We therefore investigate this ‘gating paradigm’ employing two spoken language resources in a cross- and combined manner with a focus on valence: we determine how quick a reliable estimate is obtainable and whether matching by models trained on the same length of speech prevails. In addition we analyze how individual feature groups by type and derived functionals respond and find considerably different behavior. The language resources have been chosen to cover for manually segmented and automatically segmented speech at the same time. In the result one second of speech is sufficient on the datasets considered.
Bibliographic reference. Schuller, Björn / Devillers, Laurence (2010): "Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm", In INTERSPEECH-2010, 801-804.