The goal of this project is to convert a given speaker's speech (the Source speaker) into another identified voice (the Target speaker) as well as analysing the face animation of the source to animate a 3D avatar imitating the source facial movements. We assume we have at our disposal a large amount of speech samples from the source and target voices with a reasonable amount of parallel data. Speech and video are processed separately and recombined at the end.
Voice conversion is obtained in two steps: a voice mapping step followed by a speech synthesis step. In the speech synthesis step, we specifically propose to select speech frames directly from the large target speech corpus, in a way that recall the unit-selection principle used in state-of-the-art text-to-speech systems.
The output of this four weeks work can be summarized as: a tailored source database, a set of open-source MATLAB and C files and finally audio and video files obtained by our conversion method. Experimental results show that we cannot aim to reach the target with our LPC synthesis method; further work is required to enhance the quality of the speech.
Index Terms-voice conversion, speech-to-speech conversion, speaker mapping, face tracking, cloning, morphing, avatar control.
Cite as: Dutoit, T., Holzapfel, A., Jottrand, M., Marqués, F., Moinet, A., Ofli, F., Pérez, J., Stylianou, Y. (2006) Multimodal speaker conversion - his master's voice ... and face. Proc. Summer Workshop on Multimodal Interfaces (eINTERFACE 2006), 34-45
@inproceedings{dutoit06_einterface, author={Thierry Dutoit and A. Holzapfel and M. Jottrand and F. Marqués and A. Moinet and F. Ofli and J. Pérez and Yannis Stylianou}, title={{Multimodal speaker conversion - his master's voice ... and face}}, year=2006, booktitle={Proc. Summer Workshop on Multimodal Interfaces (eINTERFACE 2006)}, pages={34--45} }