Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

HMM-Based Text-To-Audio-Visual Speech Synthesis

Shinji Sako (1), Keiichi Tokuda (1), Takashi Masuko (2), Takao Kobayashi (2), Tadashi Kitamura (1)

(1) Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan
(2) Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, Japan

This paper describes a technique for text-to-audio-visual speech synthesis based on hidden Markov models (HMMs), in which lip image sequences are modeled based on imageor pixel-based approach. To reduce the dimensionality of visual speech feature space, we obtain a set of orthogonal vectors (eigenlips) by principal components analysis (PCA), and use a subset of the PCA coefficients and their dynamic features as visual speech parameters. Auditory and visual speech parameters are modeled by HMMs separately, and lip movements are synchronized with auditory speech by using phoneme boundaries of auditory speech for synthesizing lip image sequences. We confirmed that the generated auditory speech and lip image sequences are realistic and synchronized naturally.


Full Paper

Bibliographic reference.  Sako, Shinji / Tokuda, Keiichi / Masuko, Takashi / Kobayashi, Takao / Kitamura, Tadashi (2000): "HMM-based text-to-audio-visual speech synthesis", In ICSLP-2000, vol.3, 25-28.