This paper addresses talking head synthesis based on the concatenation of units comprising of both acoustic and visual information. Selection of appropriate diphone units to synthesize a given text string is based on the minimization of a weighted linear combination of four costs that reflect linguistic, acoustic, and visual considerations. We present initial work toward a method to determine automatically the weights applied to each cost, using a series of metrics that assess quantitatively the performance of synthesis.
Bibliographic reference. Toutios, Asterios / Musti, Utpala / Ouni, Slim / Colotte, Vincent (2011): "Weight optimization for bimodal unit-selection talking head synthesis", In INTERSPEECH-2011, 2249-2252.