In concatenative synthesis the join cost function can be related to the probability of a perceived discontinuity at the join. Therefore it is important that the distance measures in the cost function correlate highly with human perceived discontinuities. In this paper the results of a listening test on joins in two Norwegian long vowels: /A:/ and /e:/, is presented. Five spectral distance measures and the F0 difference are compared as predictors of the human perceived discontinuities using Receiver Operating Characteristic (ROC) curves. In addition, a linear join cost function is optimized by means of stepwise linear regression.
Cite as: Bjørkan, I., Svendsen, T., Farner, S. (2005) Comparing spectral distance measures for join cost optimization in concatenative speech synthesis. Proc. Interspeech 2005, 2577-2580, doi: 10.21437/Interspeech.2005-799
@inproceedings{bjrkan05_interspeech, author={Ingmund Bjørkan and Torbjørn Svendsen and Snorre Farner}, title={{Comparing spectral distance measures for join cost optimization in concatenative speech synthesis}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={2577--2580}, doi={10.21437/Interspeech.2005-799} }