ESCA Workshop on Audio-Visual Speech Processing (AVSP'97)

September 26-27, 1997
Rhodes, Greece

Speaker Independent Audio-Visual Database for Bimodal ASR

Gerasimos Potamianos (1), Eric Cosatto (2), Hans Peter Graf (2), David B. Roe (1)

(1) AT&T Labs-Research, Florham Park, NJ, USA
(2) AT&T Labs-Research, Red Bank, NJ, USA

This paper describes the audio-visual database collected at AT&T Labs-Research for the study of bimodal speech recognition. To date, this database consists of two multiple speaker parts, namely isolated confusable words and connected letters, thus allowing the study of some popular and relatively simple speaker independent audio-visual recognition tasks. In addition, a single speaker connected digits database is collected to facilitate speedy development and testing of various algorithms. Intentionally, no lip markings are used on the subjects during data collection. Development of robust and speaker independent algorithms for mouth location and lip contour extraction is thus necessary in order to obtain informative features about visual speech (visual front end). We describe our approach to this problem, and we report our automatic speech-reading and audio-visual speech recognition results on the single speaker connected digits task.

Full Paper

Bibliographic reference.  Potamianos, Gerasimos / Cosatto, Eric / Graf, Hans Peter / Roe, David B. (1997): "Speaker independent audio-visual database for bimodal ASR", In AVSP-1997, 65-68.