A controllable computer animated avatar that could be used as a natural user interface for computers is demonstrated. Driven by text and emotion input, it generates expressive speech with corresponding facial movements. To create the avatar, HMM-based text-to-speech synthesis is combined with active appearance model (AAM)-based facial animation. The novelty is the degree of control achieved over the expressiveness of both the speech and the face while keeping the controls simple. Controllability is achieved by training both the speech and facial parameters within a cluster adaptive training (CAT) framework. CAT creates a continuous, low dimensional eigenspace of expressions, which allows the creation of expressions of different intensity (including ones more intense than those in the original recordings) and combining different expressions to create new ones. Results on an emotion-recognition task show that recognition rates given the synthetic output are comparable to those given the original videos of the speaker.
Bibliographic reference. Wan, Vincent / Anderson, Robert / Blokland, Art / Braunschweiler, Norbert / Chen, Langzhou / Kolluru, BalaKrishna / Latorre, Javier / Maia, Ranniery / Stenger, Björn / Yanagisawa, Kayoko / Stylianou, Yannis / Akamine, Masami / Gales, M. J. F. / Cipolla, Roberto (2013): "Photo-realistic expressive text to talking head synthesis", In INTERSPEECH-2013, 2667-2669.