14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition

Ahmed Hussen Abdelaziz, Steffen Zeiler, Dorothea Kolossa

Ruhr-Universität Bochum, Germany

In this paper we propose the use of the recently introduced twin-HMM-based audio-visual speech enhancement algorithm as a front-end for audio-visual speech recognition systems. This algorithm determines the clean speech statistics in the recognition domain based on the audio-visual observations and transforms these statistics to the synthesis domain through the so-called twin HMMs. The adopted front-end is used together with back-end methods like the conventional maximum likelihood decoding or the newly introduced significance decoding. The proposed combination of the front- and back-end is applied to acoustically corrupted signals of the Grid audio-visual corpus and results in statistically significant improvements of the audio-visual recognition accuracy compared to using the ETSI advanced front-end.

Full Paper

Bibliographic reference.  Abdelaziz, Ahmed Hussen / Zeiler, Steffen / Kolossa, Dorothea (2013): "Using twin-HMM-based audio-visual speech enhancement as a front-end for robust audio-visual speech recognition", In INTERSPEECH-2013, 867-871.