In this paper, we first present a shape and appearance model for Audio-Visual Automatic Speech Recognition. The shape model is a template (mean shape) and a set of deformation vectors to transform it into any possible shape. The global appearance model is a neural network trained to classify 5*5 colour image blocks as from skin, lips or inside of mouth. Both parts of this model were built automatically (without handlabelling). Appearance model was built using speech bimodality (acoustic information). We then propose several measures for quality evaluation of lip location. Finally, we show the classification results obtained using a hand-labelled and two automatically built appearance models of the lips.
Cite as: Daubias, P., Deleglise, P. (2001) Evaluation of an automatically obtained shape and appearance model for automatic audio visual speech recognition. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 1031-1034, doi: 10.21437/Eurospeech.2001-295
@inproceedings{daubias01_eurospeech, author={Philippe Daubias and Paul Deleglise}, title={{Evaluation of an automatically obtained shape and appearance model for automatic audio visual speech recognition}}, year=2001, booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)}, pages={1031--1034}, doi={10.21437/Eurospeech.2001-295} }