This paper presents a knowledge-based method of early-stage high level multimodal fusion of data obtained from speech input and visual scene. The ultimate goal is to develop a human-computer multimodal interface to assist elderly people living alone at home to perform their daily activities, and to support their active ageing and social cohesion. Crucial for multimodal high level fusion and successful communication is the provision of extensive semantics and contextual information from spoken language understanding. To address this we propose to extract natural language semantic representations and map them onto the restricted domain ontology. This information is then processed for multimodal reference resolution together with visual scene input. To make our approach flexible and widely applicable, a priori situational knowledge, modalities and the fusion process are modelled in the ontology expressing the domain constraints. Here, we illustrate ontology-based multimodal fusion on an example scenario combining speech and visual scene analysis.
Bibliographic reference. Vybornova, Olga / Gemo, Monica / Moncarey, Ronald / Macq, Benoit (2007): "Ontology-based multimodal high level fusion involving natural language analysis for aged people home care application", In INTERSPEECH-2007, 2577-2580.