In addition to speech, visual information (e.g., facial expressions, head motions, and gestures) is an important part of human communication. It conveys, explicitly or implicitly, the intentions, the emotion states, and other paralinguistic information encoded in the speech chain. In this paper we present a multi-language, real-time text-to-audiovisual speech synthesis system, which automatically generates both audio and visual streams for a given text. While the audio stream is generated by our new HMM-based TTS engine, the visual stream is rendered by incorporating multiple animation channels, which control a cartoon figure parameterized in a 3D model simultaneously. The challenges in synthesizing, synchronizing, and integrating multiple-channel information sources are investigated and methods of generating natural, realistic animations are developed. The result of rendering all available or learned information is an expressive audio-visual synthesis module for user-friendly, human-machine communication applications.
Bibliographic reference. Wang, Lijuan / Qian, Xiaojun / Ma, Lei / Qian, Yao / Chen, Yining / Soong, Frank K. (2008): "A real-time text to audio-visual speech synthesis system", In INTERSPEECH-2008, 2338-2341.