Auditory-Visual Speech Processing (AVSP) 2013

Annecy, France
August 29 - September 1, 2013

Audio-Visual Speaker Conversion using Prosody Features

Adela Barbulescu (1,2), Thomas Hueber (1), Gérard Bailly (1), Remi Ronfard (2)

(1) GIPSA-Lab, CNRS & Universite de Grenoble, St Martin d’Hères, France
(2) IMAGINE team, INRIA / LJK, Grenoble, France

The article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.

Index Terms: speaker identity conversion, gaussian mixture model, dynamic features, prosodic features

Full Paper

Bibliographic reference.  Barbulescu, Adela / Hueber, Thomas / Bailly, Gérard / Ronfard, Remi (2013): "Audio-visual speaker conversion using prosody features", In AVSP-2013, 11-16.