The measure of the goodness, or cost, of concatenating synthesis units plays an important role in concatenative speech synthesis. In this paper, we present a probabilistic approach to concatenation modeling in which the goodness of concatenation is represented as the conditional probability of observing the spectral shape of a unit given the previous unit and the current phonetic context. This conditional probability is modeled by a conditional Gaussian density whose mean vector has a form of linear transform of the past spectral shape. A phonetic decision-tree based parameter tying is performed to achieve a robust training that balances between model complexity and the amount of training data available. The concatenation models are implemented in a corpus-based speech synthesizer trained with a CMU Arctic database and the effectiveness of the proposed method was confirmed by a subjective listening test.
Cite as: Sakai, S., Kawahara, T. (2006) Decision tree-based training of probabilistic concatenation models for corpus-based speech synthesis. Proc. Interspeech 2006, paper 1564-Wed2A3O.2, doi: 10.21437/Interspeech.2006-484
@inproceedings{sakai06_interspeech, author={Shinsuke Sakai and Tatsuya Kawahara}, title={{Decision tree-based training of probabilistic concatenation models for corpus-based speech synthesis}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1564-Wed2A3O.2}, doi={10.21437/Interspeech.2006-484} }