Fifth ISCA ITRW on Speech Synthesis
June 14-16, 2004
Concatenative speech synthesis systems attempt to minimize audible discontinuities between two successive concatenated units. In unit selection concatenative synthesis, a join cost is calculated that is intended to predict the extent of audible discontinuity introduced by the concatenation of two specific units. A study was conducted that used human perceptual data on the detectability of mid-vowel concatenation discontinuities to train and to test several models for predicting perceptually-based join costs. Both linear regression (LR) and classification and regression tree (CART) models were used. Each was trained on several different sets of predictor variables. All LR and some CART models used strictly acoustic predictor variables, some CART models used acoustic plus phonetic categorical variables, and one CART model used strictly phonetic predictors. Results from tests of LR and CART models showed that, when trained with the same acoustic predictor variables, the two models achieved very similar results in predicting human detection rates. Euclidean cepstral distances were superior to VQ cepstral distances as predictor variables. Categorical phonetic predictor variables in CART models greatly improved the accuracy of prediction of concatenation discontinuities.
Bibliographic reference. Syrdal, Ann K. / Conkie, Alistair D. (2004): "Data-driven perceptually based join costs", In SSW5-2004, 49-54.