10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction

Kshitiz Kumar (1), Jiri Navratil (2), Etienne Marcheret (2), Vit Libal (2), Gerasimos Potamianos (3)

(1) Carnegie Mellon University, USA
(2) IBM T.J. Watson Research Center, USA
(3) NCSR “Demokritos”, Greece

We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, speaker diarization, speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extending our previously proposed time-evolution model of audio-visual features to include non-causal (future) feature information. This significantly improves robustness of the method to small time-alignment errors between the audio and visual streams, as demonstrated by our experiments. In addition, we compare the proposed model to two known literature approaches for audio-visual synchrony detection, namely mutual information and hypothesis testing, and we show that our method is superior to both.

Full Paper

Bibliographic reference.  Kumar, Kshitiz / Navratil, Jiri / Marcheret, Etienne / Libal, Vit / Potamianos, Gerasimos (2009): "Robust audio-visual speech synchrony detection by generalized bimodal linear prediction", In INTERSPEECH-2009, 2251-2254.