The Seventh ISCA Tutorial and Research Workshop on Speech Synthesis
This paper proposes an HMM-based voice conversion (VC) technique with quantized F0 symbol context using adaptive F0 quantization. In the HMM-based VC, an input utterance of a source speaker is decoded into phonetic and prosodic symbol sequences, and the converted speech is generated using the decoded information from the pre-trained target speakers phonetically and prosodically context-dependent HMM. In our previous work, we generated the F0 symbol by quantizing the average log F0 value of each phone using global mean and variance parameters calculated from the training data. In this study, these statistical parameters are obtained sentence-by-sentence, and this adaptive approach enables the more robust F0 conversion than the conventional our technique. Objective and subjective experimental results for English and Japanese speech show that the proposed adaptive quantization technique gives better F0 conversion performance than the conventional one. Moreover, the HMM-based VC is significantly robust for the variation of the source speakers individuality compared to the GMM-based one.
Bibliographic reference. Nose, Takashi / Kobayashi, Takao (2010): "HMM-based robust voice conversion using adaptive F0 quantization", In SSW7-2010, 80-85.