Auditory-Visual Speech Processing (AVSP) 2011

Volterra, Italy
September 1-2, 2011

Introducing Visual Target Cost within an Acoustic-Visual Unit-Selection Speech Synthesizer

Utpala Musti, Vincent Colotte, Asterios Toutios, Slim Ouni

Nancy Université - LORIA, Nancy, France

In this paper, we present a method to take into account visual information during the selection process in an acoustic-visual synthesizer. The acoustic-visual speech synthesizer is based on the selection and concatenation of synchronous bimodal diphone units i.e., speech signal and 3D facial movements of the speaker’s face. The visual speech information is acquired using a stereovision technique. Unit selection for synthesis is based on the classical target cost consisting of linguistic and phonological features. We compare several methods to take into account the visual articulatory context in the target cost. We present an objective evaluation of the synthesis results based on correlation of the actual visual speech trajectory and synthesized visual speech trajectory.

Index Terms. speech synthesis, unit selection, target costs.

Full Paper

Bibliographic reference.  Musti, Utpala / Colotte, Vincent / Toutios, Asterios / Ouni, Slim (2011): "Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer", In AVSP-2011, 49-55.