Unit selection synthesis has improved the quality of synthetic speech by making it possible to concatenate speech from a large database to produce intelligible synthesis while preserving much of the naturalness of the original signal. Such synthesis is by no means perfect, however, and this paper describes work to achieve more optimal joins between concatenated units. Results from a psychoacoustic experiment, acoustic parameters and phonetic factors are analyzed and used in statistical training of join costs so that audible discontinuities at concatenation boundaries can be minimized.
Cite as: Syrdal, A.K., Conkie, A.D. (2005) Perceptually-based data-driven join costs: comparing join types. Proc. Interspeech 2005, 2813-2816, doi: 10.21437/Interspeech.2005-620
@inproceedings{syrdal05_interspeech, author={Ann K. Syrdal and Alistair D. Conkie}, title={{Perceptually-based data-driven join costs: comparing join types}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={2813--2816}, doi={10.21437/Interspeech.2005-620} }