5th International Conference on Spoken Language Processing
Voice adaptation describes the process of converting the output of a text-to-speech synthesizer voice to sound like a different voice after a training process in which only a small amount of the desired target speaker's speech is seen. We employ a locally linear conversion function based on Gaussian mixture models to map bark-scaled line spectral frequencies. We compare performance for three different estimation methods while varying the number of mixture components and the amount of data used for training. An objective evaluation revealed that all three methods yield similar test results. In perceptual tests, listeners judged the converted speech quality as acceptable and fairly successful in adapting to the target speaker.
Bibliographic reference. Kain, Alexander / Macon, Michael W. (1998): "Text-to-speech voice adaptation from sparse training data", In ICSLP-1998, paper 0902.