An automatic system for segmenting speech signals used for the training of statistical prosody models is presented. Starting from a canonical transcription, the system simultaneously delivers an accurate phonetic segmentation and the matched phonetic transcription indicating pronunciation variants.
Although the system is HMM-based, it uses only the speech signals of the prosody database which typically consists of a few hundred sentences with some 30 minutes total duration. Initial phone HMMs are generated with flat-start training using the canonical transcriptions of the sentences. Then iterative Viterbi search for best-matching pronunciation variants and HMM retraining is applied until convergence is attained.
Cite as: Romsdorfer, H., Pfister, B. (2005) Phonetic labeling and segmentation of mixed-lingual prosody databases. Proc. Interspeech 2005, 3281-3284, doi: 10.21437/Interspeech.2005-572
@inproceedings{romsdorfer05_interspeech, author={Harald Romsdorfer and Beat Pfister}, title={{Phonetic labeling and segmentation of mixed-lingual prosody databases}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={3281--3284}, doi={10.21437/Interspeech.2005-572} }