ISCA Archive AVSP 2013
ISCA Archive AVSP 2013

Visual voice activity detection at different speeds

Bart Joosten, Eric Postma, Emiel Krahmer

Visual Voice Activity Detection (VVAD) refers to the detection of speech from a video sequence by means of visual cues. VVAD provides a useful addition to auditory voice activity detection, in particular in cases involving multiple speakers or background noise. This paper focusses explicitly on the measurement of facial movements at different speeds to determine which rates of movement contribute to VVAD. Facial movements in video sequences of talking faces are measured using a spatiotemporal Gabor transform. VVAD performances based on these measurements are determined for different speeds and compared to simple frame-differencing. In addition, performances are assessed for the entire frame, the head region, and the mouth region. The results obtained reveal an elevated VVAD performance for large speeds as compared to low speeds. In addition, frame differencing performs at a level comparable to that of the spatiotemporal Gabor method at the optimal speeds.

Index Terms:visual active speech, frame differencing, Gabor transform, spatiotemporal Gabor transform

Cite as: Joosten, B., Postma, E., Krahmer, E. (2013) Visual voice activity detection at different speeds. Proc. Auditory-Visual Speech Processing, 187-190

  author={Bart Joosten and Eric Postma and Emiel Krahmer},
  title={{Visual voice activity detection at different speeds}},
  booktitle={Proc. Auditory-Visual Speech Processing},