Sixth European Conference on Speech Communication and Technology
Concatenative Text-to-Speech (TTS) systems such as those described by Hunt and Black  can select at synthesis time from a very large number of recorded units. The selected units are chosen to minimize a combination of target and join costs for a given sentence. However, the join costs, in particular, can be quite expensive to com-pute, even when this computation has been optimized. If possible, we would avoid this computation by precomputing and caching all the possible join costs, but their number is prohibitive. Although the search space of possible joins is large, we have found that only a small fraction are selected in practice. By synthesizing a large quantity of text and logging the units actually selected, we were able to gather usage statistics and construct a practical and efficient cache of concatenation costs. Use of this cache dramatically decreased the runtime of the AT&T Next-Generation TTS system  with negligible effect on speech quality. Experiments show that by caching 0.7% of the possible joins, 99% of the join cost computations can be avoided.
Full Paper (PDF) Gnu-Zipped Postscript
Bibliographic reference. Beutnagel, Mark / Mohri, Mehryar / Riley, Michael (1999): "Rapid unit selection from a large speech corpus for concatenative speech synthesis", In EUROSPEECH'99, 607-610.