Concatenative "selection-based" synthesis from large databases has emerged as a viable framework for TTS waveform generation. Unit selection algorithms attempt to predict the appropriateness of a particular database speech segment using only linguistic features output by text analysis and prosody prediction components of a synthesizer. All of these algorithms have in common a training or "learning" phase in which parameters are trained to select appropriate waveform seg- ments for a given feature vector input. One approach to this step is to partition available data into clusters that can be indexed by linguistic features available at runtime. This method relies critically on two important principles: discrimination of fine phonetic details using a perceptually-motivated distance measure in training and generalization to unseen cases in selection. In this paper, we describe eorts to systematically investigate and improve these parts of the process.
Cite as: Macon, M.W., Cronk, A.E., Wouters, J. (1998) Generalization and discrimination in tree-structured unit selection. Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis (SSW 3), 195-200
@inproceedings{macon98_ssw, author={Michael W. Macon and Andrew E. Cronk and Johan Wouters}, title={{Generalization and discrimination in tree-structured unit selection}}, year=1998, booktitle={Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis (SSW 3)}, pages={195--200} }