Voice conversion based on full-covariance mixture density networks for time-variant linear transformations

Gaku Kotani, Daisuke Saito


This paper integrates a density estimation scheme based on neural networks with voice conversion (VC) under constraints of time-variant linear transformation. In VC, deep neural networks (DNNs) are used as conversion models that represent mapping from source to target features, in which a stack of multiple nonlinear transformations is applied to source ones. In automatic speech recognition and text-to-speech synthesis, direct mapping between source and target features by DNNs works effectively and flexibly since DNNs are suitable for such tasks in which input and output feature domains are heterogeneous, i.e. speech-to-text or text-to-speech. On the other hand, the case of VC is different from them, i.e. input and output features usually exist on the same domain, such as cepstral space. This condition may help more effective and flexible DNN-based VC. From this point of view, VC based on DNNs for time-variant linear transformations has been suggested. The method can utilize the condition, in which a trained model outputs parameters of linear transformations for each time index t: a linear transformation matrix At and a bias vector bt . It was observed that the method improved the performance of VC. However, the detailed properties of At and bt have still been obscure. In this paper, in order to reveal it, full-covariance mixture density networks are introduced to the VC framework. In the proposed method, joint density of source and target features is directly estimated from the source features by mixture density networks. From the help of tight relationship between Gaussian and linear transformation, the correspondence between the parameters At and bt , and density of the feature space become clear. The proposed scheme was investigated by experiments of VC, and the results showed that naturalness improvement was observed compared with naive DNN-based VC and the decided correspondence between At and bt was observed.


 DOI: 10.21437/SSW.2019-14

Cite as: Kotani, G., Saito, D. (2019) Voice conversion based on full-covariance mixture density networks for time-variant linear transformations. Proc. 10th ISCA Speech Synthesis Workshop, 75-80, DOI: 10.21437/SSW.2019-14.


@inproceedings{Kotani2019,
  author={Gaku Kotani and Daisuke Saito},
  title={{Voice conversion based on full-covariance mixture density networks for time-variant linear transformations}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={75--80},
  doi={10.21437/SSW.2019-14},
  url={http://dx.doi.org/10.21437/SSW.2019-14}
}