INTERSPEECH 2014
15th Annual Conference of the International Speech Communication Association

Singapore
September 14-18, 2014

Integrating Sequence Information in the Audio-Visual Detection of Word Prominence in a Human-Machine Interaction Scenario

Andrea Schnall (1), Martin Heckmann (2)

(1) Technische Universität Darmstadt, Germany
(2) Honda Research Institute Europe, Germany

Modifying the articulatory parameters to raise the prominence of a segment of an utterance (hyperarticulating) is usually accompanied by a reduction of these parameters (hypoarticulation) for the neighboring segments. In this paper we investigate different approaches for the automatic labeling of the prominence of words. In particular, we investigate how the information in the sequence can be used. During the recording of the underlying audio-visual database, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only. We extracted an extensive range of features from the audio and visual channel. For the classification of word prominence we compare two algorithms. On the one hand SVM, a local classifier, on the other hand a classifier based on a sequential model, linear chain Conditional Random Fields (CRF). Both were trained on different context regions. For the CRF the whole sentence is used as a word sequence for training and testing. Overall we show that sequence models such as CRF, which performs best in our experiment, are suited for prominence detection and, furthermore, that the neighboring words contain information which further improves the detection.

Full Paper

Bibliographic reference.  Schnall, Andrea / Heckmann, Martin (2014): "Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario", In INTERSPEECH-2014, 2640-2644.