Interspeech'2005 - Eurospeech
An automatic system for segmenting speech signals used for the training of statistical prosody models is presented. Starting from a canonical transcription, the system simultaneously delivers an accurate phonetic segmentation and the matched phonetic transcription indicating pronunciation variants.
Although the system is HMM-based, it uses only the speech signals of the prosody database which typically consists of a few hundred sentences with some 30 minutes total duration. Initial phone HMMs are generated with flat-start training using the canonical transcriptions of the sentences. Then iterative Viterbi search for best-matching pronunciation variants and HMM retraining is applied until convergence is attained.
Bibliographic reference. Romsdorfer, Harald / Pfister, Beat (2005): "Phonetic labeling and segmentation of mixed-lingual prosody databases", In INTERSPEECH-2005, 3281-3284.