Voice Conversion Based on Matrix Variate Gaussian Mixture Model Using Multiple Frame Features

Yi Yang, Hidetsugu Uchida, Daisuke Saito, Nobuaki Minematsu


This paper presents a novel voice conversion method based on matrix variate Gaussian mixture model (MV-GMM) using features of multiple frames. In voice conversion studies, approaches based on Gaussian mixture models (GMM) are still widely utilized because of their flexibility and easiness in handling. They treat the joint probability density function (PDF) of feature vectors from source and target speakers as that of joint vectors of the two vectors. Addition of dynamic features to the feature vectors in GMM-based approaches achieves certain performance improvements because the correlation between multiple frames is taken into account. Recently, a voice conversion framework based on MV-GMM, in which the joint PDF is modeled in a matrix variate space, has been proposed and it is able to precisely model both the characteristics of the feature spaces and the relation between the source and target speakers. In this paper, in order to additionally model the correlation between multiple frames in the framework more consistently, MV-GMM is constructed in a matrix variate space containing the features of neighboring frames. Experimental results show that an certain performance improvement in both objective and subjective evaluations is observed.


DOI: 10.21437/Interspeech.2016-705

Cite as

Yang, Y., Uchida, H., Saito, D., Minematsu, N. (2016) Voice Conversion Based on Matrix Variate Gaussian Mixture Model Using Multiple Frame Features. Proc. Interspeech 2016, 302-306.

Bibtex
@inproceedings{Yang+2016,
author={Yi Yang and Hidetsugu Uchida and Daisuke Saito and Nobuaki Minematsu},
title={Voice Conversion Based on Matrix Variate Gaussian Mixture Model Using Multiple Frame Features},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-705},
url={http://dx.doi.org/10.21437/Interspeech.2016-705},
pages={302--306}
}