ITRW on
Adaptation Methods for Speech Recognition

August 29-30, 2001
Sophia Antipolis, France

Dominant Speaker Detection Based on Voicing for Adaptive Audio-Visual ASR Robust to Speech Noise

Hervé Glotin

ICP, Institut de la Communication Parlée, Grenoble, France and
IDIAP, Institut Dalle Molle d’Intellingence Artificielle, EPFL, Switzerland

We investigate the use of voicing in state-of-the-art Large Vocabulary Continuous Audio-visual automatic Speech Recognition (AV-LVCSR). In this work we apply an original adaptive weighting function using voicing level to estimate the appropriate combination weights for each of the modalities. We show that we can improve the state-of-the-art AV-LVCSR performance under speech noise by using a detector of the dominant speaker which is a function of the voicing level. We re- fine the weighting function according to sensibility and speci- ficity of the dominant speaker detector. In this first experiment, weighting functions are threshold functions of the voicing level. Rather than testing all possible thresholds, three of them are arbitrarily chosen so that the sensitivity, or specificity of the detector, reaches 95%, or so that sensitivity and specificity are equal. Results show that the AV-LVCSR system we use is improved by 5.7% using a weighing function with high sensibility to dominant speaker activity.

Full Paper

Bibliographic reference.  Glotin, Hervé (2001): "Dominant speaker detection based on voicing for adaptive audio-visual ASR robust to speech noise", In Adaptation-2001, 89-92.