Auditory-Visual Speech Processing 2007 (AVSP2007)
Kasteel Groenendaal, Hilvarenbeek, The Netherlands
In the German SmartWeb project, the user is interacting with the web via a PDA in order to get information on, for example, points of interest. To overcome the tedious use of devices such as push-to-talk, but still to be able to tell whether the user is addressing the system or talking to herself or to a third person, we developed a module that monitors speech and video in parallel. Our database (3.2 hours of speech, 2086 turns) has been recorded in a real-life setting, indoors as well as outdoors, with unfavourable acoustic and light conditions. With acoustic features, we classify up to 4 different types of addressing (talking to the system: On-Talk, reading from the display: Read Off- Talk, paraphrasing information presented on the screen: Paraphrasing Off-Talk, talking to a third person or to oneself: Spontaneous Off-Talk). With a camera integrated in the PDA, we record the user’s face and decide whether she is looking onto the PDA or somewhere else. We use three different types of turn features based on classification scores of frame-based face detection and word-based analysis: 13 acoustic-prosodic features, 18 linguistic features, and 9 video features. The classification rate for acoustics only is up to 62 % for the four-class problem, and up to 77 % for the most important two-class problem "user is focussing on interaction with the system or not". For video only, it is 45 % and 71 %, respectively. By combining the two modalities, and using linguistic information in addition, classification performance for the two-class problem so far rises up to 85 %.
Bibliographic reference. Batliner, Anton / Hacker, Christian / Kaiser, Moritz / Mögele, Hannes / Nöth, Elmar (2007): "Taking into account the user²s focus of attention with the help of audio-visual information: towards less artificial human-machine-communication", In AVSP-2007, paper P15.