Auditory-Visual Speech Processing (AVSP) 2010

Hakone, Kanagawa, Japan
September 30-October 3, 2010

Limitations of Visual Speech Recognition

Jacob L. Newman, Barry-John Theobald, Stephen J. Cox

School of Computer Sciences, University of East Anglia, Norwich, UK

In this paper we investigate the limits of automated lip-reading systems and we consider the improvement that could be gained were additional information from other (non-visible) speech articulators available to the recogniser. Hidden Markov model (HMM) speech recognisers are trained using electromagnetic articulography (EMA) data drawn from the MOCHA-TIMIT data set. Articulatory information is systematically withheld from the recogniser and the performance is tested and compared with that of a typical state of the art lip-reading system. We find that, as expected, the performance of the recogniser degrades as articulatory information is lost, and that a typical lip-reading system achieves a level of performance similar to an EMAbased recogniser that uses information from only the front of the tongue forwards. Our results show that there is significant information in the articulator positions towards the back of the mouth that could be exploited were it available, but even this is insufficient to achieve the same level of performance as can be achieved by an acoustic speech recogniser.

Index Terms: automated lip-reading, visual speech recognition, articulatory analysis

Full Paper

Bibliographic reference.  Newman, Jacob L. / Theobald, Barry-John / Cox, Stephen J. (2010): "Limitations of visual speech recognition", In AVSP-2010, paper P1.