In statistical voice conversion, distance measure between the converted and target spectral parameters are often used as evalu-ation/training metrics. However, even if same speaker utters the same sentence several times, the spectral parameters of those utterances vary, and therefore, a spectral distance between them still exists. Moreover during real-time conversion procedure, converted speech keeping original prosodic features of input speech is often generated because converting prosodic feature with complex method is essentially difficult. In such a case, an ideal sample of converted speech will be a utterance uttered by a target speaker imitating prosody of the input speech. How-ever a spectral variation caused by such a prosodic change is not considered in the current evaluation/training metrics. In this study, we investigate an intra-speaker spectral variation between utterances of the same sentence focusing on mel-cepstral coeffi-cients as a spectral parameter. Moreover, we propose a method for predicting it from prosodic parameter differences between those utterances and conduct experimental evaluations to show its effectiveness.
Index Terms: voice conversion, training/evaluation criterion, intra-speaker spectral variation, prosodic differences, prediction
Cite as: Inukai, T., Toda, T., Neubig, G., Sakti, S., Nakamura, S. (2013) Investigation of intra-speaker spectral parameter variation and its prediction towards improvement of spectral conversion metric. Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8), 89-94
@inproceedings{inukai13_ssw, author={Tatsuo Inukai and Tomoki Toda and Graham Neubig and Sakriani Sakti and Satoshi Nakamura}, title={{Investigation of intra-speaker spectral parameter variation and its prediction towards improvement of spectral conversion metric}}, year=2013, booktitle={Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8)}, pages={89--94} }