In this paper, we propose a novel voice conversion method called speaker model alignment (SMA), which does not require parallel training speech. Firstly, the source and target speaker models, described by Gaussian mixture model (GMM), are trained, respectively. Then, the transformation function of spectral features is learned by aligning the components of source and target speaker models iteratively. Additionally, the transformation function is further combined with GMM, enabling the multiple local mappings, and a local consistent GMM (LCGMM) is also considered for model training to improve the conversion accuracy. Finally, we carry out experiments to evaluate the performance of the proposed method. Objective and subjective experimental results demonstrate that compared with the well-known INCA approach, the proposed method achieves lower spectral distortions and higher correlations, and obtains a significant improvement in perceptual quality and similarity.
Bibliographic reference. Song, Peng / Jin, Yun / Zheng, Wenming / Zhao, Li (2014): "Text-independent voice conversion using speaker model alignment method from non-parallel speech", In INTERSPEECH-2014, 2308-2312.