The accurate modelling of fundamental frequency, or F0, in HMM-based speech synthesis is a critical factor in achieving high quality speech. However, it is also difficult because F0 values are normally considered to depend on a binary voicing decision such that they are continuous in voiced regions and undefined in unvoiced regions. A widely used solution is to use a multi-space probability distribution HMM (MSDHMM), which directly models discontinuous F0 observations. An alternative solution, continuous F0 modelling, has been recently proposed and shown to be more effective in achieving natural synthesised speech. Here, continuous F0 observations are assumed to always exist and hence they can be modelled by standard HMMs.
This paper describes a general mathematical framework for discontinuous F0 modelling, of which MSDHMM is a special case, and compares it to continuous F0 modelling. Various aspects associated with continuous F0 modelling, the use of a single F0 stream, globally tied distributions (GTD) and the assumption of a continuous unvoiced F0, are discussed in theory and examined in experiments. Both objective measures and subjective listening tests demonstrate that the introduction of continuous unvoiced F0 is vital for achieving speech quality improvement.
Index Terms: F0 modelling, MSDHMM, globally tied distribution, HMM based speech synthesis
Cite as: Yu, K., Thomson, B., Young, S. (2010) From discontinuous to continuous F0 modelling in HMM-based speech synthesis. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 94-99
@inproceedings{yu10_ssw, author={Kai Yu and Blaise Thomson and Steve Young}, title={{From discontinuous to continuous F0 modelling in HMM-based speech synthesis}}, year=2010, booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)}, pages={94--99} }