ISCA Archive ICSLP 2000
ISCA Archive ICSLP 2000

HMM-based text-to-audio-visual speech synthesis

Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura

This paper describes a technique for text-to-audio-visual speech synthesis based on hidden Markov models (HMMs), in which lip image sequences are modeled based on imageor pixel-based approach. To reduce the dimensionality of visual speech feature space, we obtain a set of orthogonal vectors (eigenlips) by principal components analysis (PCA), and use a subset of the PCA coefficients and their dynamic features as visual speech parameters. Auditory and visual speech parameters are modeled by HMMs separately, and lip movements are synchronized with auditory speech by using phoneme boundaries of auditory speech for synthesizing lip image sequences. We confirmed that the generated auditory speech and lip image sequences are realistic and synchronized naturally.

Cite as: Sako, S., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T. (2000) HMM-based text-to-audio-visual speech synthesis. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 3, 25-28

  author={Shinji Sako and Keiichi Tokuda and Takashi Masuko and Takao Kobayashi and Tadashi Kitamura},
  title={{HMM-based text-to-audio-visual speech synthesis}},
  booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)},
  pages={vol. 3, 25-28}