Auditory-Visual Speech Processing (AVSP) 2009
University of East Anglia, Norwich, UK
We extract relevant and informative audio-visual features using multiple multi-class Support Vector Machines with probabilistic outputs, and demonstrate the approach in a noisy audio-visual speech reading scenario. We first extract visual spatio-temporal features and audio cepstral coefficients from pronounced digit sequences. Two classifiers are then trained on a single modality to obtain confidence factors that are used to select the most appropriate fusion strategy. A final classifier is trained on the joint audiovisual feature space and used to recognize digits. We demonstrate the proposed approach on a standard database and compare it with alternative methods. The evaluation shows that the proposed approach outperforms the alternatives both in terms of recognition accuracy and in terms of robustness.
Bibliographic reference. Pachoud, Samuel / Gong, Shaogang / Cavallaro, Andrea (2009): "Space-time audio-visual speech recognition with multiple multi-class probabilistic support vector machines", In AVSP-2009, 155-160.