SAPA-SCALE Conference 2012

Portland, OR, USA
September 7-8, 2012

Explicit Duration Modelling in HMM-based Speech Synthesis using a Hybrid Hidden Markov Model-Multilayer Perceptron

Kalu U. Ogbureke, João P. Cabral, Julie Carson-Berndsen

CNGL, School of Computer Science and Informatics, University College Dublin, Ireland

In HMM-based speech synthesis, it is important to correctly model duration because it has a significant effect on the perceptual quality of speech, such as rhythm. For this reason, hidden semi-Markov model (HSMM) is commonly used to explicitly model duration instead of using the implicit state duration model of HMM through its transition probabilities. The cost of using HSMM to improve duration modelling is the increase in computational complexity of the parameter re-estimation algorithms and duration clustering using contextual features. This paper proposes to use an alternative explicit duration modelling approach to HSMM which is a hybrid of HMM and multilayer perceptron (MLP). The HMM is initially used for state-level phone alignment, in order to obtain state durations of HMM for each phone. In the second stage, duration modelling is done using an MLP where the inputs are contextual features and the output units are the state durations. Both objective and perceptual evaluations showed that the proposed duration modelling method improved the prediction of duration and the perceptual quality of synthetic speech as compared with HSMM.

Index Terms: duration modelling, HMM-based TTS, hidden Markov model, multilayer perceptron

Full Paper

Bibliographic reference.  Ogbureke, Kalu U. / Cabral, João P. / Carson-Berndsen, Julie (2012): "Explicit duration modelling in HMM-based speech synthesis using a hybrid hidden Markov model-multilayer perceptron", In SAPA-SCALE-2012, 58-63.