Sixth European Conference on Speech Communication and Technology
(EUROSPEECH'99)

Budapest, Hungary
September 5-9, 1999

Text-to-Audio-Visual Speech Synthesis Based on Parameter Generation from HMM

Masatsune Tamura, Shigekazu Kondo, Takashi Masuko, Takao Kobayashi

Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, Japan

This paper describes a technique for synthesizing auditory speech and lip motion from an arbitrary given text. The technique is an extension of the visual speech synthesis technique based on an algorithm for parameter generation from HMM with dynamic features. Audio and visual features of each speech unit are modeled by a single HMM. Since both audio and visual parameters are generated simultaneously in a unified framework, auditory speech with synchronized lip movements can be generated automatically. We train both syllable and triphone models as the speech synthesis units, and compared their performance in text-to-audio-visual speech synthesis. Experimental results show that the generated audio-visual speech using triphone models achieved higher performance than that using syllable models.


Full Paper (PDF)   Gnu-Zipped Postscript

Bibliographic reference.  Tamura, Masatsune / Kondo, Shigekazu / Masuko, Takashi / Kobayashi, Takao (1999): "Text-to-audio-visual speech synthesis based on parameter generation from HMM", In EUROSPEECH'99, 959-962.