The most popular method for automatic segmentation is embedded reestimation of monophone HMMs after flat start initialization, followed by forced alignment. This method may not yield accurate boundaries. To address this issue, group delay based processing of short-time energy (STE) is performed on the speech signal to obtain syllable boundaries. The syllable boundaries are accurate, but there are a number of spurious insertions as the text transcription is not used during segmentation. The boundaries obtained using group delay segmentation in the vicinity of the HMM syllable boundaries are used as correct boundaries to reestimate the monophone HMM models, where the monophone HMMs are restricted to the syllable boundaries rather than the whole utterance. The reestimated boundaries are again compared with the group delay boundaries and corrected again. Essentially signal processing for detecting boundaries and statistical segmentation for acoustic modelling work in tandem to obtain accurate segmentation at both phoneme and syllable levels. Considering phones and syllables as basic units, HMM based speech synthesis systems (HTS) are built with the proposed segmentation method. Listening tests indicate that there is an improvement in the quality of synthesis.
Bibliographic reference. Shanmugam, S. Aswin / Murthy, Hema (2014): "A hybrid approach to segmentation of speech using group delay processing and HMM based embedded reestimation", In INTERSPEECH-2014, 1648-1652.