Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features

Satoshi Tamura, Kento Horio, Hajime Endo, Satoru Hayamizu, Tomoki Toda


This paper proposes Audio-Visual Voice Conversion (AVVC) methods using Deep BottleNeck Features (DBNF) and Deep Canonical Correlation Analysis (DCCA). DBNF has been adopted in several speech applications to obtain better feature representations. DCCA can generate much correlated features in two views and enhance features in one modality based on another view. In addition, DCCA can make projections from different views ideally to the same vector space. Firstly, in this work, we enhance our conventional AVVC scheme by employing the DBNF technique in the visual modality. Secondly, we apply the DCCA technology to DBNFs for new effective visual features. Thirdly, we build a cross-modal voice conversion model available for both audio and visual DCCA features. In order to clarify effectiveness of these frameworks, we carried out subjective and objective evaluations and compared them with conventional methods. Experimental results show that our DBNF- and DCCA-based AVVC can successfully improve the quality of converted speech waveforms.


 DOI: 10.21437/Interspeech.2018-2286

Cite as: Tamura, S., Horio, K., Endo, H., Hayamizu, S., Toda, T. (2018) Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features. Proc. Interspeech 2018, 2469-2473, DOI: 10.21437/Interspeech.2018-2286.


@inproceedings{Tamura2018,
  author={Satoshi Tamura and Kento Horio and Hajime Endo and Satoru Hayamizu and Tomoki Toda},
  title={Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2469--2473},
  doi={10.21437/Interspeech.2018-2286},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2286}
}