INTERSPEECH 2008
9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

A Real-Time Text to Audio-Visual Speech Synthesis System

Lijuan Wang (1), Xiaojun Qian (2), Lei Ma (1), Yao Qian (1), Yining Chen (1), Frank K. Soong (1)

(1) Microsoft Research Asia, China; (2) Fudan University, China

In addition to speech, visual information (e.g., facial expressions, head motions, and gestures) is an important part of human communication. It conveys, explicitly or implicitly, the intentions, the emotion states, and other paralinguistic information encoded in the speech chain. In this paper we present a multi-language, real-time text-to-audiovisual speech synthesis system, which automatically generates both audio and visual streams for a given text. While the audio stream is generated by our new HMM-based TTS engine, the visual stream is rendered by incorporating multiple animation channels, which control a cartoon figure parameterized in a 3D model simultaneously. The challenges in synthesizing, synchronizing, and integrating multiple-channel information sources are investigated and methods of generating natural, realistic animations are developed. The result of rendering all available or learned information is an expressive audio-visual synthesis module for user-friendly, human-machine communication applications.

Full Paper

Bibliographic reference.  Wang, Lijuan / Qian, Xiaojun / Ma, Lei / Qian, Yao / Chen, Yining / Soong, Frank K. (2008): "A real-time text to audio-visual speech synthesis system", In INTERSPEECH-2008, 2338-2341.