The lack of naturalness hampers the widespread application of speech synthesis. Increasing the size of the unit database in a concatenative speech synthesizer has been proposed as a method to increase the variety of units-thereby improving naturalness. However, expanding the unit database increases the computational cost of selecting the most appropriate unit and compounds the risk that a perceptually suboptimal unit is chosen. Clustering the unit database prior to synthesis is an effective method for reducing this cost and risk. In this study, a unit selection method based on tree-structured clustering of data is implemented and evaluated. This approach to tree construction differs from similar approaches used in both synthesis and recognition in that a "right-sized" tree is found automatically rather than using hand-tuned stopping criteria. The tree is grown to its maximum size, and its leaves are systematically recombined in order to determine the most suitable subtree. Trees are grown using the automatic stopping method and compared with those grown using thresholds. Cross validation shows that trees grown to their maximum size and systematically recombined produce fuller clusters with lower objective distortion measures than trees whose growth is arrested by a threshold. The study concludes with a discussion of how these results may affect the perceptual quality of a speech synthesizer.
Cite as: Cronk, A., Macon, M.W. (1998) Optimized stopping criteria for tree-based unit selection in concatenative synthesis. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 0680, doi: 10.21437/ICSLP.1998-23
@inproceedings{cronk98_icslp, author={Andrew Cronk and Michael W. Macon}, title={{Optimized stopping criteria for tree-based unit selection in concatenative synthesis}}, year=1998, booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)}, pages={paper 0680}, doi={10.21437/ICSLP.1998-23} }