Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Perceptually Based Automatic Prosody Labeling and Prosodically Enriched Unit Selection Improve Concatenative Text-to-Speech Synthesis

Colin W. Wightman (1), Ann K. Syrdal, Georg Stemmer (2), Alistair Conkie, Mark Beutnagel

AT&T Labs - Research, Florham Park, NJ, USA
(1) also Dept. of Computer and Information Sciences, Minnesota State University, Mankato, MN, USA
(2) also Informatik Dept., University of Erlangen, Germany

Prosody is an important factor in the quality of text-to-speech (TTS) synthesis. Typically, acoustic parameters such as f0 and duration are the only variables related to prosody that are used to determine unit selection. Our study explored adding the explicit use of linguistically and perceptually motivated prosodic categories in unit selection-based TTS. One of our goals was to automate the process of prosodically labeling our TTS inventory. However, reliability among labelers for some ToBI (Tones and Break Indices) categories was too low for successful training of an automatic prosody recognizer. We developed a prosody labeling system simpler and more robust than standard EToBI (English ToBI). This "ToBI Lite" system was used successfully for automatic labeling of the acoustic inventory and in prosodically enriched unit selection. A formal listening test was conducted to compare subjective quality ratings for several variations of the AT&T unit selection concatenative TTS system that differed only in their method of prosodic labeling of the inventory or their use of prosody for unit selection. The use of simple prosodic categories in unit selection significantly improved ratings, and automatic prosodic labeling resulted in higher ratings than manual labeling.

Full Paper

Bibliographic reference.  Wightman, Colin W. / Syrdal, Ann K. / Stemmer, Georg / Conkie, Alistair / Beutnagel, Mark (2000): "Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis", In ICSLP-2000, vol.2, 71-74.