This paper proposes a segment-based voice conversion technique between arbitrary speakers with a small amount of training data. In the proposed technique, an input speech utterance of source speaker is decoded into phonetic and prosodic symbol sequences, and then the converted speech is generated from the pre-trained target speaker's HMM using the decoded information. To reduce the required amount of training data, we use speaker-independent model in the decoding of the input speech, and model adaptation for the training of target speaker's model. Experimental results show that there is no need to prepare the source speaker's training data, and the proposed technique with only ten sentences of the target speaker's adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences.
Bibliographic reference. Nose, Takashi / Kobayashi, Takao (2010): "Speaker-independent HMM-based voice conversion using quantized fundamental frequency", In INTERSPEECH-2010, 1724-1727.