SAPA-SCALE Conference 2012
Portland, OR, USA
In HMM-based speech synthesis, it is important to correctly model duration because it has a significant effect on the perceptual quality of speech, such as rhythm. For this reason, hidden semi-Markov model (HSMM) is commonly used to explicitly model duration instead of using the implicit state duration model of HMM through its transition probabilities. The cost of using HSMM to improve duration modelling is the increase in computational complexity of the parameter re-estimation algorithms and duration clustering using contextual features. This paper proposes to use an alternative explicit duration modelling approach to HSMM which is a hybrid of HMM and multilayer perceptron (MLP). The HMM is initially used for state-level phone alignment, in order to obtain state durations of HMM for each phone. In the second stage, duration modelling is done using an MLP where the inputs are contextual features and the output units are the state durations. Both objective and perceptual evaluations showed that the proposed duration modelling method improved the prediction of duration and the perceptual quality of synthetic speech as compared with HSMM.
Index Terms: duration modelling, HMM-based TTS, hidden Markov model, multilayer perceptron
Bibliographic reference. Ogbureke, Kalu U. / Cabral, João P. / Carson-Berndsen, Julie (2012): "Explicit duration modelling in HMM-based speech synthesis using a hybrid hidden Markov model-multilayer perceptron", In SAPA-SCALE-2012, 58-63.