7th International Conference on Spoken Language Processing
September 16-20, 2002
High accuracy phonetic segmentation is critical for achieving good quality in concatenative text to speech synthesis. Due to the shortcomings of current automated techniques based on HMM-based alignment or Dynamic Time Warping (DTW), manual verification and labeling are often required. In this paper we present a novel technique for automatic placement of phoneme boundaries in a speech waveform using explicit statistical models for phoneme boundaries. Thus we are able to cut down substantially on the labor and time intensive manual labeling process required to build a new voice. The phonetic speech segmentation is carried out using a two-step process, similar to the way a human expert would label the waveform. In the first step an initial estimate of the labeling is generated using an HMMbased phoneme recognizer. The second step refines the boundary placements by searching for the best match in a region near the estimated boundaries with predefined boundary models generated from existing labeled speech corpora. The proposed method can be used in conjunction with any of the segmentation schemes used in practice. In the performance evaluations carried out the system is able to give time marks which are 30-40% better than the schemes currently used.
Bibliographic reference. Sethy, Abhinav / Narayanan, Shrikanth S. (2002): "Refined speech segmentation for concatenative speech synthesis", In ICSLP-2002, 149-152.