The HMM-based Text-to-Speech System can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. However the quality of synthetic speech degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (VU) decisions are the two key factors in voice quality problems. In this paper, an F0 generation process model is used to re-estimate F0 values in the regions of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and they are used for VU decision. Then the F0 can be modeled within the standard HMM framework.
Bibliographic reference. Wang, Miaomiao / Wen, Miaomiao / Hirose, Keikichi / Minematsu, Nobuaki (2010): "Improved generation of fundamental frequency in HMM-based speech synthesis using generation process model", In INTERSPEECH-2010, 2166-2169.