Eighth ISCA Workshop on Speech Synthesis

Barcelona, Catalonia, Spain
August 31-September 2, 2013

Exemplar-Based Voice Conversion using Non-Negative Spectrogram Deconvolution

Zhizheng Wu (1,2), Tuomas Virtanen (3), Tomi Kinnunen (4), Eng Siong Chng (1,2), Haizhou Li (1,2,5)

(1) School of Computer Engineering, Nanyang Technological University, Singapore
(2) Temasek Laboratories@NTU, Nanyang Technological University, Singapore
(3) Department of Signal Processing, Tampere University of Technology, Tampere, Finland
(4) School of Computing, University of Eastern Finland, Joensuu, Finland
(5) Human Language Technology Department, Institute for Infocomm Research, Singapore

In the traditional voice conversion, converted speech is generated using statistical parametric models (for example Gaussian mixture model) whose parameters are estimated from parallel training utterances. A well-known problem of the statistical parametric methods is that statistical average in parameter estimation results in the over-smoothing of the speech parameter trajectories, and thus leads to low conversion quality. Inspired by recent success of so-called exemplar-based methods in robust speech recognition, we propose a voice conversion system based on non-negative spectrogram deconvolution with similar ideas. Exemplars, which are able to capture temporal context, are employed to generate converted speech spectrogram convolutely. The exemplar-based approach is seen as a data-driven, nonparametric approach as an alternative to the traditional parametric approaches to voice conversion. Experiments on VOICES database indicate that the proposed method outperforms the conventional joint density Gaussian mixture model by a wide margin in terms of both objective and subjective evaluations. Index Terms: Voice conversion, exemplar, non-negative matrix factorization, non-negative matrix deconvolution, temporal information

Full Paper

Bibliographic reference.  Wu, Zhizheng / Virtanen, Tuomas / Kinnunen, Tomi / Chng, Eng Siong / Li, Haizhou (2013): "Exemplar-based voice conversion using non-negative spectrogram deconvolution", In SSW8, 201-206.