8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Integrating Audio and Visual Cues for Speaker Friendliness in Multimodal Speech Synthesis

David House

KTH, Sweden

This paper investigates interactions between audio and visual cues to friendliness in questions in two perception experiments. In the first experiment, manually edited parametric audio-visual synthesis was used to create the stimuli. Results were consistent with earlier findings in that a late, high final focal accent peak was perceived as friendlier than an earlier, lower focal accent peak. Friendliness was also effectively signaled by visual facial parameters such as a smile, head nod and eyebrow raising synchronized with the final accent. Consistent additive effects were found between the audio and visual cues for the subjects as a group and individually showing that subjects integrate the two modalities. The second experiment used data-driven visual synthesis where the database was recorded by an actor instructed to portray anger and happiness. Friendliness was correlated to the happy database, but the effect was not as strong as for the parametric synthesis.

Full Paper

Bibliographic reference.  House, David (2007): "Integrating audio and visual cues for speaker friendliness in multimodal speech synthesis", In INTERSPEECH-2007, 1250-1253.