Auditory-Visual Speech Processing 2005

British Columbia, Canada
July 24-27, 2005

Consonant Confusion Structure Based on Machine Classification of Visual Features in Continuous Speech

Jianxia Xue (1), Jintao Jiang (2), Abeer Alwan (1), Lynne E. Bernstein (2,3)

(1) Department of Electrical Engineering, University of California, Los Angeles, CA, USA
(2) Department of Communication Neuroscience, House Ear institute, Los Angeles, CA, USA
(3) 3National Science Foundation, Social, Behavioral, and Economics Directorate, Arlington, VA, USA

This study is a first step in selecting an appropriate subword unit representation to synthesize highly intelligible 3D talking faces. Consonant confusions were obtained with optic features from a 320-sentence database, spoken by a male talker, using Gaussian mixture models and maximum a posteriori classification methods. The results were compared to consonant confusions obtained from visual-only human perception tests of non-sense CV syllables spoken by the same talker. At the phoneme level, machine classification results for the continuous speech database had better performance than human perception with isolated syllables. However, the number of distinguishable consonant clusters by machine is less than that by humans. To model the optic feature for continuous visual speech synthesis, the results suggest that for most consonants, modeling optic feature in phoneme level is more appropriate than modeling in phoneme clusters determined from visual-only human perception tests. For some consonants, modeling in a context-dependent manner might be helpful in improving the modeling accuracy for the talker studied in this paper.

Full Paper

Bibliographic reference.  Xue, Jianxia / Jiang, Jintao / Alwan, Abeer / Bernstein, Lynne E. (2005): "Consonant confusion structure based on machine classification of visual features in continuous speech", In AVSP-2005, 103-108.