Although temporal information of speech has been shown to play an important role in perception, most of the voice conversion approaches assume the speech frames are independent of each other, thereby ignoring the temporal information. In this study, we improve conventional unit selection approach by using exemplars which span multiple frames as base units, and also take temporal information constraint into voice conversion by using overlapping frames to generate speech parameters. This approach thus provides more stable concatenation cost and avoids discontinuity problem in conventional unit selection approach. The proposed method also keeps away from the over-smoothing problem in the mainstream joint density Gaussian mixture model (JD-GMM) based conversion method by directly using target speaker's training data for synthesizing the converted speech. Both objective and subjective evaluations indicate that our proposed method outperforms JD-GMM and conventional unit selection methods.
Bibliographic reference. Wu, Zhizheng / Virtanen, Tuomas / Kinnunen, Tomi / Chng, Eng Siong / Li, Haizhou (2013): "Exemplar-based unit selection for voice conversion utilizing temporal information", In INTERSPEECH-2013, 3057-3061.