Auditory-Visual Speech Processing (AVSP'99)

August 7-10, 1999
Santa Cruz, CA, USA

Picture My Voice: Audio to Visual Speech Synthesis Using Artificial Neural Networks

Dominic W. Massaro, Jonas Beskow, Michael M. Cohen, Christopher L. Fry, Tony Rodgriguez

Perceptual Science Laboratory, University of California, Santa Cruz, CA, USA

This paper presents an initial implementation and evaluation of a system that synthesizes visual speech directly from the acoustic waveform. An artificial neural network (ANN) was trained to map the cepstral coefficients of an individual's natural speech to the control parameters of an animated synthetic talking head. We trained on two data sets; one was a set of 400 words spoken in isolation by a single speaker and the other a subset of extemporaneous speech from 10 different speakers. The system showed learning in both cases. A perceptual evaluation test indicated that the system's generalization to new words by the same speaker provides significant visible information, but significantly below that given by a text-to-speech algorithm.


Full Paper

Bibliographic reference.  Massaro, Dominic W. / Beskow, Jonas / Cohen, Michael M. / Fry, Christopher L. / Rodgriguez, Tony (1999): "Picture my voice: Audio to visual speech synthesis using artificial neural networks", In AVSP-1999, paper #23.

Multimedia Files

Link Original Filename Description Format
av99_23_1.avi (7667 KB)
av99_23_2.mov (3344 KB)
jb1a.avi / jb1a.mov Five words from the experiment are presented. For each word, the TTS synthesized version is shown followed by the ANN synthesized version Video File: AVI / Quicktime MOV
Cinepack setting 5.0, 232x464, approx 300Kb/sec
av99_23_3.avi (2108 KB)
av99_23_4.mov (962 KB)
jb1b.avi / jb1b.mov Five words from the experiment are presented. For each word, the TTS synthesized version is shown followed by the ANN synthesized version Video File: AVI / Quicktime MOV
Cinepack setting 5.0, 116x232, approx 82Kb/sec
av99_23_5.mov (2042 KB)
av99_23_6.mov (1345 KB)
jb1c.mov / jb1c.mov Five words from the experiment are presented. For each word, the TTS synthesized version is shown followed by the ANN synthesized version Video File: Quicktime MOV; (_5) Cinepack setting 4.0, 232x464 Quicktime, approx 202Kb/sec; (_6) Cinepack setting 3.0, 232x464 Quicktime, approx 134Kb/sec