8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

A Database Design for a TTS Synthesis System Using Lexical Diphones

Tanya Lambert (1), Andrew Breen (2)

(1) University of East Anglia, Norwich, UK
(2) Nuance Communications Inc, UK

Database designs, if based on the premise that there are about 2000 diphones in English, as stated in many publications and on-line documents, are likely to render a database of diphones, which will fail to capture some important phonological phenomena of English. This paper proposes a TTS database, which is built from diphones inclusive of their syllabic stress; we term these units lexical diphones. A comprehensive lexical diphone feature set is generated using a stress-annotated dictionary and continuous text and speech. A method based on multiple set cover algorithms, applied to wordlists of specialized English usage, and a knowledge-based phonological approach, are used to produce a core text corpus of 540 sentences. An objective evaluation of our database with other databases shows that our database (considering its size) has a higher concentration of lexical diphones; a subjective evaluation shows listeners' preference for the speech where there are more lexical than phonemic units.

Full Paper

Bibliographic reference.  Lambert, Tanya / Breen, Andrew (2004): "A database design for a TTS synthesis system using lexical diphones", In INTERSPEECH-2004, 1381-1384.