To generate a new concatenative text-to-speech (TTS) voice from recordings of a human's voice, not only recordings but also additional information such as the transcriptions, prosodic labels, and the phonemic alignments are necessary. Since some of the information depends on the speaking style of the narrator, these types of information need to be manually added by listening to the recordings, which is costly and time consuming. To tackle this problem, we have been working on a totally trainable TTS system every component of which, including the text processing module, can be automatically trained from a speech corpus. In this paper, we refine the framework and propose several submodules to collect all of the linguistic and acoustic information necessary for generating a TTS voice from the recorded speech. Though completely automatic generation of a new voice is not yet possible, we report progress in the submodules by showing experimental results.
Bibliographic reference. Tachibana, Ryuki / Nagano, Tohru / Kurata, Gakuto / Nishimura, Masafumi / Babaguchi, Noboru (2007): "Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone", In INTERSPEECH-2007, 1917-1920.