Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data

Xu Li, Zhiyong Wu, Helen Meng, Jia Jia, Xiaoyan Lou, Lianhong Cai


One of the essential problems in synthesizing expressive talking avatar is how to model the interactions between emotional facial expressions and lip movements. Traditional methods either simplify such interactions through separately modeling lip movements and facial expressions, or require substantial high quality emotional audio-visual bimodal training data which are usually difficult to collect. This paper proposes several methods to explore different possibilities in capturing the interactions using a large-scale neutral corpus in addition to a small size emotional corpus with limited amount of data. To incorporate contextual influences, deep bidirectional long short-term memory (DBLSTM) recurrent neural network is adopted as the regression model to predict facial features from acoustic features, emotional states as well as contexts. Experimental results indicate that the method by concatenating neutral facial features with emotional acoustic features as the input of DBLSTM model achieves the best performance in both objective and subjective evaluations.


DOI: 10.21437/Interspeech.2016-364

Cite as

Li, X., Wu, Z., Meng, H., Jia, J., Lou, X., Cai, L. (2016) Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data. Proc. Interspeech 2016, 1477-1481.

Bibtex
@inproceedings{Li+2016,
author={Xu Li and Zhiyong Wu and Helen Meng and Jia Jia and Xiaoyan Lou and Lianhong Cai},
title={Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-364},
url={http://dx.doi.org/10.21437/Interspeech.2016-364},
pages={1477--1481}
}