This paper describes an experimental AT&T concatenative synthesis system using unit selection, for which the basic synthesis units are diphones. The synthesizer may use any of the data from a large database of utterances. Since there are in general multiple instances of each concatenative unit, the system performs dynamic unit selection. Selection among candidates is done dynamically at synthesis, in a manner that is based on and extends unit selection implemented in the CHATR synthesis system [1][4]. Selected units may be either phones or diphones, and they can be synthesized by a variety of methods, including PSOLA [5], HNM [3], and simple unit concatenation. The AT&T system, with CHATR unit selection, was implemented within the framework of the Festival Speech Synthesis System [2]. The voice database amounted to approximately one and one-half hours of speech and was constructed from read text taken from three sources. The first source was a portion of the 1989 Wall Street Journal material from the Penn Treebank Project, so that the most frequent diphones were well represented. Complete diphone coverage was assured by the second text, which was designed for diphone databases [6]. A third set of data consisted of recorded prompts for telephone service applications. Subjective formal listening tests were conducted to compare speech quality for several options that exist in the AT&T synthesizer, including synthesis methods and choices of fundamental units. These tests showed that unit selection techniques can be successfully applied to diphone synthesis.
s
A. Black. CHATR, Version 0.8, a generic speech synthe- sis. System documentation. ATR - Interpreting Telecom- munications Laboratories, Kyoto, Japan, March 1996. A. Black and P. Taylor. The Festival Speech Synthe- sis System: system documentation. Technical Report HCRC/TR-83. Human Communications Research Cen- tre, University of Edinburgh, Scotland, UK, January 1997. Y. Stylianou, T. Dutoit, and J. Schroeter. Diphones con- catenation using a harmonic plus noise model of speech. Proc. EUROSPEECH, Sept. 1997. A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. ICASSP, 1:373-376, 1996. E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech syn- thesis using diphones. Speech Communication, 9 (5/6):453{467, 1990. A. Syrdal. Development of a female voice for a concate- native text-to-speech synthesis system. Current Topics in Acoust. Res., 1:169-181, 1994.
Cite as: Beutnagel, M., Conkie, A., Syrdal, A.K. (1998) Diphone synthesis using unit selection. Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis (SSW 3), 185-190
@inproceedings{beutnagel98_ssw, author={Mark Beutnagel and Alistair Conkie and Ann K. Syrdal}, title={{Diphone synthesis using unit selection}}, year=1998, booktitle={Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis (SSW 3)}, pages={185--190} }