ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

A Voice Conversion Mapping Function Based on a Stacked Joint-Autoencoder

Seyed Hamidreza Mohammadi, Alexander Kain

In this study, we propose a novel method for training a regression function and apply it to a voice conversion task. The regression function is constructed using a Stacked Joint-Autoencoder (SJAE). Previously, we have used a more primitive version of this architecture for pre-training a Deep Neural Network (DNN). Using objective evaluation criteria, we show that the lower levels of the SJAE perform best with a low degree of jointness, and higher levels with a higher degree of jointness. We demonstrate that our proposed approach generates features that do not suffer from the averaging effect inherent in back-propagation training. We also carried out subjective listening experiments to evaluate speech quality and speaker similarity. Our results show that the SJAE approach has both higher quality and similarity than a SJAE+DNN approach, where the SJAE is used for pre-training a DNN, and the fine-tuned DNN is then used for mapping. We also present the system description and results of our submission to Voice Conversion Challenge 2016.


doi: 10.21437/Interspeech.2016-1437

Cite as: Mohammadi, S.H., Kain, A. (2016) A Voice Conversion Mapping Function Based on a Stacked Joint-Autoencoder. Proc. Interspeech 2016, 1647-1651, doi: 10.21437/Interspeech.2016-1437

@inproceedings{mohammadi16_interspeech,
  author={Seyed Hamidreza Mohammadi and Alexander Kain},
  title={{A Voice Conversion Mapping Function Based on a Stacked Joint-Autoencoder}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={1647--1651},
  doi={10.21437/Interspeech.2016-1437}
}