8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Preliminary Experiments Toward Automatic Generation of New TTS Voices from Recorded Speech Alone

Ryuki Tachibana (1), Tohru Nagano (1), Gakuto Kurata (1), Masafumi Nishimura (1), Noboru Babaguchi (2)

(1) IBM Japan, Japan
(2) Osaka University, Japan

To generate a new concatenative text-to-speech (TTS) voice from recordings of a human's voice, not only recordings but also additional information such as the transcriptions, prosodic labels, and the phonemic alignments are necessary. Since some of the information depends on the speaking style of the narrator, these types of information need to be manually added by listening to the recordings, which is costly and time consuming. To tackle this problem, we have been working on a totally trainable TTS system every component of which, including the text processing module, can be automatically trained from a speech corpus. In this paper, we refine the framework and propose several submodules to collect all of the linguistic and acoustic information necessary for generating a TTS voice from the recorded speech. Though completely automatic generation of a new voice is not yet possible, we report progress in the submodules by showing experimental results.

Full Paper

Bibliographic reference.  Tachibana, Ryuki / Nagano, Tohru / Kurata, Gakuto / Nishimura, Masafumi / Babaguchi, Noboru (2007): "Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone", In INTERSPEECH-2007, 1917-1920.