As mobile devices, intelligent displays, and home entertainment systems permeate digital markets, the desire for users to interact through spoken and visual modalities similarly grows. Previous interactive systems limit voice activity detection (VAD) to the acoustic domain alone, but the incorporation of visual features has shown great improvement in performance accuracy. When employing both acoustic and visual (AV) information the central recurring question becomes "how does one efficiently fuse modalities". This work investigates the effects of different features (from multiple modalities), different classifiers, and different fusion techniques for the task of AV-VAD on data with varying acoustic noise. Furthermore we present a novel multi-tier classifier that combines traditional approaches, feature fusion and decision fusion, with independent modality classifiers and combined intermediary decisions with raw features as inputs to a second stage classifier. Our augmented multi-tier classification system concatenates the output of a set of base classifiers with the original fused features for a final classifier. Experiments over various noise conditions show average relative improvements of 5.0% and 4.1% on the CUAVE dataset and 2.6% and 11.1% on the MOBIO  dataset over majority voters and LDA respectively.
Bibliographic reference. Burlick, Matt / Dimitriadis, Dimitrios / Zavesky, Eric (2013): "On the improvement of multimodal voice activity detection", In INTERSPEECH-2013, 685-689.