Audio-to-Visual Speech Conversion Using Deep Neural Networks

Sarah Taylor, Akihiro Kato, Iain Matthews, Ben Milner


We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results.


DOI: 10.21437/Interspeech.2016-483

Cite as

Taylor, S., Kato, A., Matthews, I., Milner, B. (2016) Audio-to-Visual Speech Conversion Using Deep Neural Networks. Proc. Interspeech 2016, 1482-1486.

Bibtex
@inproceedings{Taylor+2016,
author={Sarah Taylor and Akihiro Kato and Iain Matthews and Ben Milner},
title={Audio-to-Visual Speech Conversion Using Deep Neural Networks},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-483},
url={http://dx.doi.org/10.21437/Interspeech.2016-483},
pages={1482--1486}
}