11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion

Zhi-Zheng Wu (1), Tomi Kinnunen (2), Eng Siong Chng (1), Haizhou Li (3)

(1) Nanyang Technological University, Singapore
(2) University of Eastern Finland, Finland
(3) A*STAR, Singapore

In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion.

Full Paper

Bibliographic reference.  Wu, Zhi-Zheng / Kinnunen, Tomi / Chng, Eng Siong / Li, Haizhou (2010): "Text-independent F0 transformation with non-parallel data for voice conversion", In INTERSPEECH-2010, 1732-1735.