Auditory-Visual Speech Processing 2007 (AVSP2007)

Kasteel Groenendaal, Hilvarenbeek, The Netherlands
August 31 - September 3, 2007

Development and Comparison of Two Approaches for Visual Speech Analysis with Application to Voice Activity Detection

Bertrand Rivet (1,2), Andrew Aubrey (3), Laurent Girin (1), Yulia Hicks (3), Christian Jutten (2), Jonathon Chambers (3)

(1,2) Grenoble Image Parole Signal Automatique, Grenoble Institute of Technology (INPG), Grenoble, France
(3) Centre of Digital Signal Processing, Cardiff School of Engineering, Cardiff University, UK

In this paper we present two novel methods for visual voice activity detection (V-VAD) which exploit the bimodality of speech (i.e. the coherence between speakerís lips and the resulting speech). The first method uses appearance parameters of a speakerís lips, obtained from an active appearance model (AAM). An HMM then dynamically models the change in appearance over time. The second method uses a retinal filter on the region of the lips to extract the required parameter. A corpus of a single speaker is applied to each method in turn, where each method is used to classify voice activity as speech or non speech. The efficiency of each method is evaluated individually using receiver operating characteristics and their respective performances are then compared and discussed. Both methods achieve a high correct silence detection rate for a small false detection rate.

Full Paper

Bibliographic reference.  Rivet, Bertrand / Aubrey, Andrew / Girin, Laurent / Hicks, Yulia / Jutten, Christian / Chambers, Jonathon (2007): "Development and comparison of two approaches for visual speech analysis with application to voice activity detection", In AVSP-2007, paper P14.