INTERSPEECH 2007
8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Audio-Visual Phoneme Classification for Pronunciation Training Applications

Hedvig Kjellström (1), Olov Engwall (1), Sherif Mahdy Abdou (2), Olle Bälter (1)

(1) KTH, Sweden
(2) Cairo University, Egypt

We present a method for audio-visual classification of Swedish phonemes, to be used in computer-assisted pronunciation training. The probabilistic kernel-based method is applied to the audio signal and/or either a principal or an independent component (PCA or ICA) representation of the mouth region in video images. We investigate which representation (PCA or ICA) that may be most suitable and the number of components required in the base, in order to be able to automatically detect pronunciation errors in Swedish from audio-visual input. Experiments performed on one speaker show that the visual information help avoiding classification errors that would lead to gravely erroneous feedback to the user; that it is better to perform phoneme classification on audio and video separately and then fuse the results, rather than combining them before classification; and that PCA outperforms ICA for fewer than 50 components.

Full Paper

Bibliographic reference.  Kjellström, Hedvig / Engwall, Olov / Abdou, Sherif Mahdy / Bälter, Olle (2007): "Audio-visual phoneme classification for pronunciation training applications", In INTERSPEECH-2007, 702-705.