Modifying the articulatory parameters to raise the prominence of a segment of an utterance (hyperarticulating) is usually accompanied by a reduction of these parameters (hypoarticulation) for the neighboring segments. In this paper we investigate different approaches for the automatic labeling of the prominence of words. In particular, we investigate how the information in the sequence can be used. During the recording of the underlying audio-visual database, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only. We extracted an extensive range of features from the audio and visual channel. For the classification of word prominence we compare two algorithms. On the one hand SVM, a local classifier, on the other hand a classifier based on a sequential model, linear chain Conditional Random Fields (CRF). Both were trained on different context regions. For the CRF the whole sentence is used as a word sequence for training and testing. Overall we show that sequence models such as CRF, which performs best in our experiment, are suited for prominence detection and, furthermore, that the neighboring words contain information which further improves the detection.
Bibliographic reference. Schnall, Andrea / Heckmann, Martin (2014): "Integrating sequence information in the audio-visual detection of word prominence in a human-machine interaction scenario", In INTERSPEECH-2014, 2640-2644.