EUROSPEECH 2001 Scandinavia
This paper describes a technique for synthesizing speech with any desired voice. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. To generate speech of an arbitrarily given target speaker, speaker-independent speech units, i.e., average voice models, is adapted to the target speaker using MLLR framework. In addition to spectrum and pitch adaptation, we derive an algorithm for adaptation of state duration. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using a large amount of speech data.
Bibliographic reference. Tamura, Masatsune / Masuko, Takashi / Tokuda, Keiichi / Kobayashi, Takao (2001): "Text-to-speech synthesis with arbitrary speaker's voice from average voice", In EUROSPEECH-2001, 345-348.