Fifth ISCA ITRW on Speech Synthesis

June 14-16, 2004
Pittsburgh, PA, USA

Forced Alignment for Speech Synthesis Databases using Duration and Prosodic Phrase Breaks

Arthur R. Toth

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA

Alignment of text to recorded audio is limited by the fact that standard techniques do not handle very long utterances well. This work presents a model for segmenting long recordings into smaller utterances. Our approach differs from typical forced alignment techniques in that prosodic phrase break locations are first estimated, and then words are placed around breaks based on length and break probabilities for each word. This last step is performed by a HMM whose parameters are determined in a novel way. The results of classifying word boundaries on a wellpublicized database [1] were 65.7% accuracy on actual breaks and 92.2% overall.


  1. M. Ostendorf, P. Price, and S. Shattuck-Hufnagel, "The Boston University Radio News Corpus," Tech. Rep. ECS-95-001, Electrical, Computer and Systems Engineering Department, Boston University, Boston, MA, 1995.

Bibliographic reference.  Toth, Arthur R. (2004): "Forced alignment for speech synthesis databases using duration and prosodic phrase breaks", In SSW5-2004, 225-226.