Third International Conference on Spoken Language Processing (ICSLP 94)

Yokohama, Japan
September 18-22, 1994

Inducing Concatenative Units from Machine Readable Dictionaries and Corpora for Speech Synthesis

Judith L. Klavans (1,2), Evelyne Tzoukermann (2)

(1) Columbia University, Department of Computer Science, New York, NY, USA
(2) AT&T Bell Laboratories, Murray Hill, NJ, USA

The purpose of this research is to determine the best method for deciding on an optimal set of concatenative units for concatenative speech synthesis. Of the two main approaches to speech synthesis: segmental synthesis and rule-based synthesis, the former relies heavily on the successful choice of concatenative units. Segment al synthesis consists of concatenating segmental units (diphones, triphones, etc); rule-based synthesis consists of the computation of control parameters based on pre-established rules. Deciding on the set of diphones is quite straightforward in the sense that it suffices to take the phoneme inventory of a language, and simply combine each phoneme with every other one. For example, taking the approximately 35 French phonemes, 1225 phonemic pairs (35x35) constitute the complete and exhaustive starting diphone inventory. On the other hand, deciding on the set of triphones, quadriphones and larger units raises difficult questions about the nature of phonemes in a given language such as: (1) stability vs instability in a coarticulatory environment, (2) size of overall inventory, and (3) frequency of that unit in the language, in combination with factors (1) and (2).

We report on experiments with four different databases, with comparisons between the resources regarding their n-gram frequency output. The first two databases consist of pronunciation field information from two dictionaries, the Encyclopedic Robert French dictionary [16] with 85,000 headwords, and the smaller Collins Gem [13] containing 15,000 words. For comparison, we use two text corpora, the Hansard (about 2.5 million words) and the smaller Tubach and Boe [31] corpus (80,000 words); both corpora were processed by a set of grapheme-to-phoneme rules [18]. A frequency extraction program was applied to all four resources to extract trigram phonemic frequencies; this serves as a basis for comparison between dictionary derived data and corpus derived, frequencies.

Full Paper

Bibliographic reference.  Klavans, Judith L. / Tzoukermann, Evelyne (1994): "Inducing concatenative units from machine readable dictionaries and corpora for speech synthesis", In ICSLP-1994, 1755-1758.