15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Multimodal Exemplar-Based Voice Conversion Using Lip Features in Noisy Environments

Kenta Masaka, Ryo Aihara, Tetsuya Takiguchi, Yasuo Ariki

Kobe University, Japan

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous exemplar-based VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC method that improves the noise robustness of our previous exemplar-based VC method. As visual features, we use not only conventional DCT but also the features extracted from Active Appearance Model (AAM) applied to the lip area of a face image. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function in order to estimate the audio-visual exemplars. By using the joint audio-visual features as source features, the VC performance is improved compared to a previous audio-input exemplar-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

Full Paper

Bibliographic reference.  Masaka, Kenta / Aihara, Ryo / Takiguchi, Tetsuya / Ariki, Yasuo (2014): "Multimodal exemplar-based voice conversion using lip features in noisy environments", In INTERSPEECH-2014, 1159-1163.