5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Text-to-Speech Voice Adaptation from Sparse Training Data

Alexander Kain, Michael W. Macon

Oregon Graduate Institute of Science and Technology, USA

Voice adaptation describes the process of converting the output of a text-to-speech synthesizer voice to sound like a different voice after a training process in which only a small amount of the desired target speaker's speech is seen. We employ a locally linear conversion function based on Gaussian mixture models to map bark-scaled line spectral frequencies. We compare performance for three different estimation methods while varying the number of mixture components and the amount of data used for training. An objective evaluation revealed that all three methods yield similar test results. In perceptual tests, listeners judged the converted speech quality as acceptable and fairly successful in adapting to the target speaker.

Full Paper

Bibliographic reference.  Kain, Alexander / Macon, Michael W. (1998): "Text-to-speech voice adaptation from sparse training data", In ICSLP-1998, paper 0902.