In this paper we present results for the audio-visual discrimination of prominent from non-prominent words on a dataset with 16 speakers and more than 5000 utterances. We collected data in an experiment where users were interacting via speech in a small game, designed as a Wizard-of-Oz experiment, with a computer. Following misunderstandings of one single word of the system, users were instructed to correct this word using prosodic cues only. Hence we obtain a dataset which contains the same word with normal and with high prominence. We extract an extensive range of features from the acoustic and visual channel. Thereby we also introduce fundamental frequency curvature as a measure. The analysis shows that there is a large variation from speaker to speaker in respect to the discrimination accuracy between prominent and non-prominent words as well to which features yield the best results. In particular we show that the visual channel is very informative for many of the speakers and that overall the feature capturing the mouth shape is the best individual feature. Furthermore, we show that a combination of the acoustic and visual features improves the performance for many of the speakers.
Bibliographic reference. Heckmann, Martin (2013): "Inter-speaker variability in audio-visual classification of word prominence", In INTERSPEECH-2013, 1791-1795.