Fifth ISCA ITRW on Speech Synthesis
June 14-16, 2004
Alignment of text to recorded audio is limited by the fact that standard techniques do not handle very long utterances well. This work presents a model for segmenting long recordings into smaller utterances. Our approach differs from typical forced alignment techniques in that prosodic phrase break locations are first estimated, and then words are placed around breaks based on length and break probabilities for each word. This last step is performed by a HMM whose parameters are determined in a novel way. The results of classifying word boundaries on a wellpublicized database  were 65.7% accuracy on actual breaks and 92.2% overall.
Bibliographic reference. Toth, Arthur R. (2004): "Forced alignment for speech synthesis databases using duration and prosodic phrase breaks", In SSW5-2004, 225-226.