Auditory-Visual Speech Processing 2007 (AVSP2007)

Kasteel Groenendaal, Hilvarenbeek, The Netherlands
August 31 - September 3, 2007

Audiovisual Speech Source Separation: A Regularization Method based on Visual Voice Activity Detection

Bertrand Rivet (1,2), Laurent Girin (1), Christine Servière (2), Dinh-Tuan Pham (3), Christian Jutten (2)

(1,2) Grenoble Image Parole Signal Automatique [GIPSA - (1)ICP/(2)LIS], Grenoble Institute of Technology (INPG), France
(3) Laboratoire Jean Kuntzmann, Grenoble Institute of Technology (INPG), Université Joseph Fourier, Grenoble, France

Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g. lip parameters tracking) with source separation methods to improve and/or simplify the extraction of a speech signal from a mixture of acoustic signals. In this paper, we present a new approach to this problem: visual information is used here as a voice activity detector (VAD). Results show that, in the difficult case of realistic convolutive mixtures, the classic problem of the permutation of the output frequency channels can be solved using the visual information with a simpler processing than when using only audio information.

Full Paper

Bibliographic reference.  Rivet, Bertrand / Girin, Laurent / Servière, Christine / Pham, Dinh-Tuan / Jutten, Christian (2007): "Audiovisual speech source separation: a regularization method based on visual voice activity detection", In AVSP-2007, paper P07.