Fifth ISCA ITRW on Speech Synthesis

June 14-16, 2004
Pittsburgh, PA, USA

Data-Driven Perceptually Based Join Costs

Ann K. Syrdal, Alistair D. Conkie

AT&T Labs - Research, Florham Park, NJ, USA

Concatenative speech synthesis systems attempt to minimize audible discontinuities between two successive concatenated units. In unit selection concatenative synthesis, a join cost is calculated that is intended to predict the extent of audible discontinuity introduced by the concatenation of two specific units. A study was conducted that used human perceptual data on the detectability of mid-vowel concatenation discontinuities to train and to test several models for predicting perceptually-based join costs. Both linear regression (LR) and classification and regression tree (CART) models were used. Each was trained on several different sets of predictor variables. All LR and some CART models used strictly acoustic predictor variables, some CART models used acoustic plus phonetic categorical variables, and one CART model used strictly phonetic predictors. Results from tests of LR and CART models showed that, when trained with the same acoustic predictor variables, the two models achieved very similar results in predicting human detection rates. Euclidean cepstral distances were superior to VQ cepstral distances as predictor variables. Categorical phonetic predictor variables in CART models greatly improved the accuracy of prediction of concatenation discontinuities.

Full Paper

Bibliographic reference.  Syrdal, Ann K. / Conkie, Alistair D. (2004): "Data-driven perceptually based join costs", In SSW5-2004, 49-54.