Interspeech'2005 - Eurospeech
In this paper, we describe a series of perception studies on uniand multimodal cues to end of utterance. Stimuli were fragments taken from a recorded interview session, consisting of the parts in which speakers provided answers. The answers varied in length and were presented without the preceding question of the interviewer. The subjects had to predict when the speaker would finish his turn, based on video material and/or auditory material. The experiment consisted of 3 conditions: in one condition, the stimuli were presented as they were recorded (both audio and vision), in the two remaining conditions stimuli were presented in only the auditory or the visual channel. Results show that the audiovisual condition evoked the fastest reaction times and the visual condition the slowest. Arguably, the combination of cues from different modalities function as complementary sources and might thus improve prediction.
Bibliographic reference. Barkhuysen, Pashiera / Krahmer, Emiel / Swerts, Marc (2005): "Predicting end of utterance in multimodal and unimodal conditions", In INTERSPEECH-2005, 2417-2420.