Recently, exemplar-based sparse representation methods have been proposed for voice conversion. These methods reconstruct a target spectrum through a weighted linear combination from a set of basis spectra, called exemplars. To include temporal constraint, multiple-frame exemplars are employed when estimating the linear combination weights, namely activations, by the nonnegative matrix factorization technique with a sparsity constraint. In practice, low-resolution mel-scale filter bank energies rather than high-resolution spectra are employed to estimate the activations in order to reduce computational cost and memory usages. However, the conversion performance degrades due to the loss of the spectral details in the low-resolution representations. In this study, we propose a joint nonnegative matrix factorization technique to estimate the activations using both the low- and high-resolution features simultaneously. In this way, we include temporal information by using multiple-frame low-resolution exemplars for computational efficiency and one-frame high-resolution exemplars to improve spectral details at the same time. The VOICES database was employed to assess the performance of the proposed method. The experiments confirmed the effectiveness of the proposed method over conventional nonnegative matrix factorization method in term of both objective spectral distortion and subjective evaluation.
Bibliographic reference. Wu, Zhizheng / Chng, Eng Siong / Li, Haizhou (2014): "Joint nonnegative matrix factorization for exemplar-based voice conversion", In INTERSPEECH-2014, 2509-2513.