In the traditional voice conversion, converted speech is generated using statistical parametric models (for example Gaussian mixture model) whose parameters are estimated from parallel training utterances. A well-known problem of the statistical parametric methods is that statistical average in parameter estimation results in the over-smoothing of the speech parameter trajectories, and thus leads to low conversion quality. Inspired by recent success of so-called exemplar-based methods in robust speech recognition, we propose a voice conversion system based on non-negative spectrogram deconvolution with similar ideas. Exemplars, which are able to capture temporal context, are employed to generate converted speech spectrogram convolutely. The exemplar-based approach is seen as a data-driven, nonparametric approach as an alternative to the traditional parametric approaches to voice conversion. Experiments on VOICES database indicate that the proposed method outperforms the conventional joint density Gaussian mixture model by a wide margin in terms of both objective and subjective evaluations.
Index Terms: Voice conversion, exemplar, non-negative matrix factorization, non-negative matrix deconvolution, temporal information
Cite as: Wu, Z., Virtanen, T., Kinnunen, T., Chng, E.S., Li, H. (2013) Exemplar-based voice conversion using non-negative spectrogram deconvolution. Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8), 201-206
@inproceedings{wu13_ssw, author={Zhizheng Wu and Tuomas Virtanen and Tomi Kinnunen and Eng Siong Chng and Haizhou Li}, title={{Exemplar-based voice conversion using non-negative spectrogram deconvolution}}, year=2013, booktitle={Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8)}, pages={201--206} }