INTERSPEECH 2004 - ICSLP
Database designs, if based on the premise that there are about 2000 diphones in English, as stated in many publications and on-line documents, are likely to render a database of diphones, which will fail to capture some important phonological phenomena of English. This paper proposes a TTS database, which is built from diphones inclusive of their syllabic stress; we term these units lexical diphones. A comprehensive lexical diphone feature set is generated using a stress-annotated dictionary and continuous text and speech. A method based on multiple set cover algorithms, applied to wordlists of specialized English usage, and a knowledge-based phonological approach, are used to produce a core text corpus of 540 sentences. An objective evaluation of our database with other databases shows that our database (considering its size) has a higher concentration of lexical diphones; a subjective evaluation shows listeners' preference for the speech where there are more lexical than phonemic units.
Bibliographic reference. Lambert, Tanya / Breen, Andrew (2004): "A database design for a TTS synthesis system using lexical diphones", In INTERSPEECH-2004, 1381-1384.