Auditory-Visual Speech Processing (AVSP'99)
August 7-10, 1999
This paper presents an initial implementation and evaluation of a system that synthesizes visual speech directly from the acoustic waveform. An artificial neural network (ANN) was trained to map the cepstral coefficients of an individual's natural speech to the control parameters of an animated synthetic talking head. We trained on two data sets; one was a set of 400 words spoken in isolation by a single speaker and the other a subset of extemporaneous speech from 10 different speakers. The system showed learning in both cases. A perceptual evaluation test indicated that the system's generalization to new words by the same speaker provides significant visible information, but significantly below that given by a text-to-speech algorithm.
Bibliographic reference. Massaro, Dominic W. / Beskow, Jonas / Cohen, Michael M. / Fry, Christopher L. / Rodgriguez, Tony (1999): "Picture my voice: Audio to visual speech synthesis using artificial neural networks", In AVSP-1999, paper #23.
av99_23_1.avi (7667 KB)
av99_23_2.mov (3344 KB)
|jb1a.avi / jb1a.mov||Five words from the experiment are presented. For each word, the TTS synthesized version is shown followed by the ANN synthesized version||Video File: AVI / Quicktime MOV
Cinepack setting 5.0, 232x464, approx 300Kb/sec
|av99_23_3.avi (2108 KB)
av99_23_4.mov (962 KB)
|jb1b.avi / jb1b.mov||Five words from the experiment are presented. For each word, the TTS synthesized version is shown followed by the ANN synthesized version||Video File: AVI / Quicktime MOV|
Cinepack setting 5.0, 116x232, approx 82Kb/sec
|av99_23_5.mov (2042 KB)
av99_23_6.mov (1345 KB)
|jb1c.mov / jb1c.mov||Five words from the experiment are presented. For each word, the TTS synthesized version is shown followed by the ANN synthesized version||Video File: Quicktime MOV; (_5) Cinepack setting 4.0, 232x464 Quicktime, approx 202Kb/sec; (_6) Cinepack setting 3.0, 232x464 Quicktime, approx 134Kb/sec|