This paper describes optimizing a cost function for segment selection in concatenative TexttoSpeech based on perceptual characteristics. We use the norm of a local cost for each segment as an integrated cost function for a segment sequence to consider both the degradation of naturalness over the entire synthetic speech and the local degradation. The cost function is optimized by adjusting not only the power coefficient of the norm but also weights for subcosts so that the integrated cost corresponds better to perceptual scores determined by perceptual experiments. As a result, it is clarified that the correspondence of the cost can be improved to a greater degree by optimizing both the weights and the power coefficient than by optimizing either the weights or the power coefficient. However, it is also clarified that the correspondence is insufficient after optimizing the integrated cost function.
Bibliographic reference. Toda, Tomoki / Kawai, Hisashi / Tsuzaki, Minoru (2003): "Optimizing integrated cost function for segment selection in concatenative speech synthesis based on perceptual evaluations", In EUROSPEECH2003, 297300.