Auditory-Visual Speech Processing 2005
British Columbia, Canada
This study is a first step in selecting an appropriate subword unit representation to synthesize highly intelligible 3D talking faces. Consonant confusions were obtained with optic features from a 320-sentence database, spoken by a male talker, using Gaussian mixture models and maximum a posteriori classification methods. The results were compared to consonant confusions obtained from visual-only human perception tests of non-sense CV syllables spoken by the same talker. At the phoneme level, machine classification results for the continuous speech database had better performance than human perception with isolated syllables. However, the number of distinguishable consonant clusters by machine is less than that by humans. To model the optic feature for continuous visual speech synthesis, the results suggest that for most consonants, modeling optic feature in phoneme level is more appropriate than modeling in phoneme clusters determined from visual-only human perception tests. For some consonants, modeling in a context-dependent manner might be helpful in improving the modeling accuracy for the talker studied in this paper.
Bibliographic reference. Xue, Jianxia / Jiang, Jintao / Alwan, Abeer / Bernstein, Lynne E. (2005): "Consonant confusion structure based on machine classification of visual features in continuous speech", In AVSP-2005, 103-108.