Biennial on Digital Signal Processing for In-Vehicle and Mobile Systems

Sesimbra, Portugal
September 2-3, 2005

Use of Lip Information for Robust Speaker Identification and Speech Recognition

Ertan Cetingul, Engin Erzin, Yücel Yemez, A. Murat Tekalp

College of Engineering, Koç University, Sariyer, Istanbul, Turkey

This study investigates the benefits of multimodal fusion of audio, lip motion and lip texture modalities for speaker and speech recognition. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. A new lip motion modality representation based on discriminative analysis of the dense motion vectors within the same bounding box is employed for speaker/speech recognition. The fusion of audio, lip texture and lip motion modalities is performed by the so-called Reliability Weighted Summation (RWS) decision rule. Experimental results show that inclusion of lip motion and lip texture modalities provides further performance gains in both speaker identification and speech recognition scenarios.

Bibliographic reference.  Cetingul, Ertan / Erzin, Engin / Yemez, Yücel / Tekalp, A. Murat (2005): "Use of lip information for robust speaker identification and speech recognition", In DSP-in-V-2005, paper M1-4 (abstract).