Unit selection synthesis has shown itself to be capable of producing high quality natural sounding synthetic speech when constructed from large databases of well-recorded, well-labeled speech. However, the cost in time and expertise of building such voices is still too expensive and specialized to be able to build individual voices for everyone. The quality in unit selection synthesis is directly related to the quality and size of the database used. As we require our speech synthesizers to have more variation, style and emotion, for unit selection synthesis, much larger databases will be required. As an alternative, more recently we have started looking for parametric models for speech synthesis, that are still trained from databases of natural speech but are more robust to errors and allow for better modeling of variation. This paper presents the CLUSTERGEN synthesizer which is implemented within the Festival/FestVox voice building environment. As well as the basic technique, three methods of modeling dynamics in the signal are presented and compared: a simple point model, a basic trajectory model and a trajectory model with overlap and add.
Cite as: Black, A.W. (2006) CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling. Proc. Interspeech 2006, paper 1394-Wed2A3O.6, doi: 10.21437/Interspeech.2006-488
@inproceedings{black06_interspeech, author={Alan W. Black}, title={{CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1394-Wed2A3O.6}, doi={10.21437/Interspeech.2006-488} }