7th International Conference on Spoken Language Processing
September 16-20, 2002
Most concatenative speech synthesizers employ both acoustic measures and phonetic features to predict the perceptual damage caused by concatenating two waveform segments because no reliable acoustic measure has been found so far. This paper compares the predicting ability of the two kinds of predictor variables. We first conduct a perceptual experiment to measure the naturalness degradation due to signal discontinuity introduced by concatenating waveform segments. Secondly, we predict the score of naturalness degradation from acoustic measures derived from MFCC and/or phonetic features using statistical models such as a multiple regression model. Based on an investigation of the multiple regression coeffi- cients, we found that (1) the phonetic features are more effective and that (2) the acoustic measures do not provide useful information in addition to the phonetic features.
Bibliographic reference. Kawai, Hisashi / Tsuzaki, Minoru (2002): "Acoustic measures vs. phonetic features as predictors of audible discontinuity in concatenative speech synthesis", In ICSLP-2002, 2621-2624.