11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Setup for Acoustic-Visual Speech Synthesis by Concatenating Bimodal Units

Asterios Toutios, Utpala Musti, Slim Ouni, Vincent Colotte, Brigitte Wrobel-Dautcourt, Marie-Odile Berger

LORIA, France

This paper presents preliminary work on building a system able to synthesize concurrently the speech signal and a 3D animation of the speaker's face. This is done by concatenating bimodal diphone units, that is, units that comprise both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. Unit selection is based on classic target and join costs from acoustic-only synthesis, which are augmented with a visual join cost. Preliminary results indicate the benefits of the approach, since both the synthesized speech signal and the face animation are of good quality. Planned improvements and enhancements to the system are outlined.

Full Paper

Bibliographic reference.  Toutios, Asterios / Musti, Utpala / Ouni, Slim / Colotte, Vincent / Wrobel-Dautcourt, Brigitte / Berger, Marie-Odile (2010): "Setup for acoustic-visual speech synthesis by concatenating bimodal units", In INTERSPEECH-2010, 486-489.