INTERSPEECH 2007
8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

A Unified Approach to Multi-Pose Audio-Visual ASR

Patrick Lucey (1), Gerasimos Potamianos (2), Sridha Sridharan (1)

(1) Queensland University of Technology, Australia
(2) IBM T.J. Watson Research Center, USA

The vast majority of studies in the field of audio-visual automatic speech recognition (AVASR) assumes frontal images of a speaker's face, but this cannot always be guaranteed in practice. Hence our recent research efforts have concentrated on extracting visual speech information from non-frontal faces, in particular the profile view. The introduction of additional views to an AVASR system increases the complexity of the system, as it has to deal with the different visual features associated with the various views. In this paper, we propose the use of linear regression to find a transformation matrix based on synchronous frontal and profile visual speech data, which is used to normalize the visual speech in each viewpoint into a single uniform view. In our experiments for the task of multi-speaker lipreading, we show that this "pose-invariant" technique reduces train/test mismatch between visual speech features of different views, and is of particular benefit when there is more training data for one viewpoint over another (e.g. frontal over profile).

Full Paper

Bibliographic reference.  Lucey, Patrick / Potamianos, Gerasimos / Sridharan, Sridha (2007): "A unified approach to multi-pose audio-visual ASR", In INTERSPEECH-2007, 650-653.