One problem in concatenative speech synthesis is how to incorporate prosodic factors in the unit selection. Imposing a predicted prosodic contour as target specification is errorprone and does not benefit from the natural variability contained in the database. This paper introduces a method that searches for the optimal unit sequence by maximizing a joint likelihood at both segmental and prosodic level. At the segmental level, the concatenation cost and target cost are reformulated in terms of conditional and a priori probabilities which are combined with probabilistic models of fundamental frequency and duration at the syllable level and the phrase level. A generalized version of the Viterbi algorithm is used to take into account the long-term dependencies introduced by the prosodic models during the search of the optimal unit sequence. This method has been implemented in a unit selection synthesizer using an expressive speech database and a subjective evaluation shows an improvement in the prosodic quality, although the overall quality is only slightly enhanced.
Index Terms: speech synthesis, unit selection, prosody
Cite as: Veaux, C., Lanchantin, P., Rodet, X. (2010) Joint prosodic and segmental unit selection for expressive speech synthesis. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 323-327
@inproceedings{veaux10_ssw, author={Christophe Veaux and Pierre Lanchantin and Xavier Rodet}, title={{Joint prosodic and segmental unit selection for expressive speech synthesis}}, year=2010, booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)}, pages={323--327} }