This study was undertaken to examine relationships between the similarity structures of optical phonetic measures and visual phonetic perception. For this study, four talkers who varied in visual intelligibility were recorded simultaneously with a 3-dimensional optical recording system and a video camera. Subjects perceptually identified the talkers' consonant-vowel nonsense syllable utterances in a forced-choice identification task. Then, perceptual confusion matrices were analyzed using multidimensional scaling, and Euclidean distances among stimulus phonemes were obtained. Physical Euclidean distances between phonemes were computed on the raw 3-dimensional optical recordings for the phonemes used in the perceptual testing. Multilinear regression was used to generate a transformation vector between physical and perceptual distances. Then, correlations were computed between transformed physical and perceptual distances. These correlations ranged between .77 and .81 (59% and 66% variance accounted for), depending on the vowel context. This study showed that the relatively raw representations of the physical stimuli were effective in accounting for visual speech perception, a result consistent with the hypothesis that perceptual representations and similarity structures for visual speech are modality-specific.
Cite as: Bernstein, L.E., Jiang, J., Alwan, A., Auer Jr., E.T. (2001) Similarity structure in visual phonetic perception and optical phonetics. Proc. Auditory-Visual Speech Processing, 50-55
@inproceedings{bernstein01_avsp, author={Lynne E. Bernstein and Jintao Jiang and Abeer Alwan and Edward T. {Auer Jr.}}, title={{Similarity structure in visual phonetic perception and optical phonetics}}, year=2001, booktitle={Proc. Auditory-Visual Speech Processing}, pages={50--55} }