Bidirectional Voice Conversion Based on Joint Training Using Gaussian-Gaussian Deep Relational Model

Kentaro Sone, Shinji Takaki, Toru Nakashika


Statistical approaches to voice conversion based on Gaussian mixture models (GMMs) have been investigated in the last decade. These approaches attempt to model the joint distribution of source and target speakers utterances using GMMs. However, since GMMs do not have enough representation capability, they have been replaced by deep neural networks (DNNs). The DNN-based approaches attempt to represent feedforward dependencies from source utterances into target utterances using DNNs. Owing to the high representation capability of DNNs, these approaches improved qualities of the converted speech. Although the performances are improved by DNNs, DNN-based approaches cannot convert target utterances into source utterances like GMM-based approaches can. Therefore, DNN-based approaches cost twice as much to train as GMM-based approaches. To classify and generate binary-valued images, a deep relational model (DRM) has been proposed. A DRM consists of two visible layers and multiple hidden layers the same as DNNs and can classify and generate images by modeling a bidirectional relationship between images and labels. In this paper, we define a Gaussian-Gaussian DRM (GGDRM), which is the Gaussian-Gaussian form of the traditional DRM, and propose a method to apply a GGDRM to voice conversion. Experimental results show that our GGDRM-based method outperforms GMM- and DNN-based methods.


 DOI: 10.21437/Odyssey.2018-37

Cite as: Sone, K., Takaki, S., Nakashika, T. (2018) Bidirectional Voice Conversion Based on Joint Training Using Gaussian-Gaussian Deep Relational Model . Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 261-266, DOI: 10.21437/Odyssey.2018-37.


@inproceedings{Sone2018,
  author={Kentaro Sone and Shinji Takaki and Toru Nakashika},
  title={Bidirectional Voice Conversion Based on Joint Training Using Gaussian-Gaussian Deep Relational Model	},
  year=2018,
  booktitle={Proc. Odyssey 2018 The Speaker and Language Recognition Workshop},
  pages={261--266},
  doi={10.21437/Odyssey.2018-37},
  url={http://dx.doi.org/10.21437/Odyssey.2018-37}
}