ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

A real-time text to audio-visual speech synthesis system

Lijuan Wang, Xiaojun Qian, Lei Ma, Yao Qian, Yining Chen, Frank K. Soong

In addition to speech, visual information (e.g., facial expressions, head motions, and gestures) is an important part of human communication. It conveys, explicitly or implicitly, the intentions, the emotion states, and other paralinguistic information encoded in the speech chain. In this paper we present a multi-language, real-time text-to-audiovisual speech synthesis system, which automatically generates both audio and visual streams for a given text. While the audio stream is generated by our new HMM-based TTS engine, the visual stream is rendered by incorporating multiple animation channels, which control a cartoon figure parameterized in a 3D model simultaneously. The challenges in synthesizing, synchronizing, and integrating multiple-channel information sources are investigated and methods of generating natural, realistic animations are developed. The result of rendering all available or learned information is an expressive audio-visual synthesis module for user-friendly, human-machine communication applications.

doi: 10.21437/Interspeech.2008-596

Cite as: Wang, L., Qian, X., Ma, L., Qian, Y., Chen, Y., Soong, F.K. (2008) A real-time text to audio-visual speech synthesis system. Proc. Interspeech 2008, 2338-2341, doi: 10.21437/Interspeech.2008-596

  author={Lijuan Wang and Xiaojun Qian and Lei Ma and Yao Qian and Yining Chen and Frank K. Soong},
  title={{A real-time text to audio-visual speech synthesis system}},
  booktitle={Proc. Interspeech 2008},