Auditory-Visual Speech Processing (AVSP'98)

December 4-6, 1998
Terrigal - Sydney, Australia

Visual Speech Synthesis Based on Parameter Generation from HMM: Speech-Driven and Text-And-Speech-Driven Approaches

Masatsune Tamura, Takashi Masuko, Takao Kobayashi, Keiichi Tokuda

Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, (Japan)

This paper describes a technique for synthesizing synchronized lip movements from auditory input speech signal. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. Audio-visual speech unit HMMs, namely, syllable HMMs are trained with parameter vector sequences that represent both auditory and visual speech features. Input speech is recognized using the syllable HMMs and converted into a transcription and a state sequence. A sentence HMM is constructed by concatenating the syllable HMMs corresponding to the transcription for the input speech. Then an optimum visual speech parameter sequence is generated from the sentence HMM in ML sense. Since the generated parameter sequence reflects statistical information of both static and dynamic features of several phonemes before and after the current phonemes, synthetic lip motion becomes smooth and realistic. We show experimental results which demonstrate the effectiveness of the proposed technique.


Full Paper

Bibliographic reference.  Tamura, Masatsune / Masuko, Takashi / Kobayashi, Takao / Tokuda, Keiichi (1998): "VisuaL Speech Synthesis Based On Parameter Generation From HMM: speech-driven and text-and-speech-driven approaches", In AVSP-1998, 221-226.

Multimedia Files

Link Original Filename Description Format
av98_221_1.mov (537 KB) 41_01.mov Real lip movements Video File: QuickTime
av98_221_2.mov (537 KB) 41_02.mov Synthetic lip movements using speech-driven approach with dynamic features. Video File: QuickTime
av98_221_3.mov (537 KB) 41_03.mov Synthetic lip movements using text-and-speech-driven approach with dynamic features. Video File: QuickTime
av98_221_4.mov (537 KB) 41_04.mov Synthetic lip movements using speech-driven approach without dynamic features. Video File: QuickTime
av98_221_5.mov (537 KB) 41_05.mov Synthetic lip movements using text-and-speech-driven approach without dynamic features. Video File: QuickTime