Voice Conversion Based on Trajectory Model Training of Neural Networks Considering Global Variance

Naoki Hosaka, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda


This paper proposes a new training method of deep neural networks (DNNs) for statistical voice conversion. DNNs are now being used as conversion models that represent mapping from source features to target features in statistical voice conversion. However, there are two major problems to be solved in conventional DNN-based voice conversion: 1) the inconsistency between the training and synthesis criteria, and 2) the over-smoothing of the generated parameter trajectories. In this paper, we introduce a parameter trajectory generation process considering the global variance (GV) into the training of DNNs for voice conversion. A consistent framework using the same criterion for both training and synthesis provides better conversion accuracy in the original static feature domain, and the over-smoothing can be avoided by optimizing the DNN parameters on the basis of the trajectory likelihood considering the GV. Experimental results show that the proposed method outperforms the DNN-based method in term of both speech quality and speaker similarity.


DOI: 10.21437/Interspeech.2016-1035

Cite as

Hosaka, N., Hashimoto, K., Oura, K., Nankaku, Y., Tokuda, K. (2016) Voice Conversion Based on Trajectory Model Training of Neural Networks Considering Global Variance. Proc. Interspeech 2016, 307-311.

Bibtex
@inproceedings{Hosaka+2016,
author={Naoki Hosaka and Kei Hashimoto and Keiichiro Oura and Yoshihiko Nankaku and Keiichi Tokuda},
title={Voice Conversion Based on Trajectory Model Training of Neural Networks Considering Global Variance},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1035},
url={http://dx.doi.org/10.21437/Interspeech.2016-1035},
pages={307--311}
}