The Seventh ISCA Tutorial and Research Workshop on Speech Synthesis
While voice conversion methods have been popularly applied to convert the speech signals uttered by a source speaker to a target speaker, frame-based voice conversion generally suffers from incorrect alignment using only spectral distance and therefore generate improper conversion results. In a parallel phone sequence, the alignment using minimum spectral distance between frame-based feature vectors of the source and target phone sequences is theoretical impractical, since the spectral properties of the source and target phones are inherently different. Nevertheless, if the feature vectors of the phone sequence are transformed into codewords in an eigen space, the eigen-codeword occurrence distribution curves of the source and target phone sequences are likely to be similar. By integrating the codeword occurrence distribution into distance estimation, a more precise frame alignment based on dynamic time warping can be obtained. With the precise alignment, voice conversion functions can be properly constructed. Objective and subjective evaluations were conducted and the comparison results to spectral distancebased alignment confirm the improved performance of the proposed method.
Index Terms: Voice conversion, eigen vector, phone alignment
Bibliographic reference. Huang, Yi-Chin / Wu, Chung-Hsien / Lee, Chung-Han / Chao, Yu-Ting (2010): "Voice conversion using precise speech alignment based on spectral property and eigen-codeword distribution", In SSW7-2010, 62-67.