Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

A Comparison of Methods for Speaker-Dependent Pronunciation Tuning for Text-to-Speech Synthesis

Gabriel Webster, Tina Burrows, Katherine Knill

Toshiba Research Europe Ltd., UK

Unit-based text-to-speech (TTS) systems typically use a set of speech recordings that have been phonetically transcribed to create a large set of phonetic units. During synthesis, pronunciations for input text are generated and used to guide the selection of a sequence of phonetic units. The style of these system pronunciations must match the style of the phonetic transcriptions of the recorded speech database in order to maximize the quality of the synthesized speech. Furthermore, since different speakers have different speech characteristics, supporting multiple speakers for a single language generally requires applying a speaker-dependent mapping to speaker-independent pronunciations. This paper investigates three automatic methods for this process of speakerdependent pronunciation tuning: word N-grams, decision trees, and transformation-based learning. Transformation-based learning achieved the best results, lowering the phone error rate of the text pronunciations compared to the speech transcriptions by 26% over the error rate of the unmodified text transcriptions.

Full Paper

Bibliographic reference.  Webster, Gabriel / Burrows, Tina / Knill, Katherine (2005): "A comparison of methods for speaker-dependent pronunciation tuning for text-to-speech synthesis", In INTERSPEECH-2005, 2809-2812.