10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

A Minimum V/U Error Approach to F0 Generation in HMM-Based TTS

Yao Qian, Frank K. Soong, Miaomiao Wang, Zhizheng Wu

Microsoft Research Asia, China

The HMM-based TTS can produce a highly intelligible and decent quality voice. However, HMM model degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (v/u) decisions are identified as two key factors in voice quality problems. In this paper, we propose a minimum v/u error approach to F0 generation. A prior knowledge of v/u is imposed in each Mandarin phone and accumulated v/u posterior probabilities are used to search for the optimal v/u switching point in each VU or UV segment in generation. Objectively the new approach is shown to improve v/u prediction performance, specifically on voiced to unvoiced swapping errors. They are reduced from 3.7% (baseline) down to 2.0% (new approach). The improvement is also subjectively confirmed by an AB preference test score, 72% (new approach) versus 22% (baseline).

Full Paper

Bibliographic reference.  Qian, Yao / Soong, Frank K. / Wang, Miaomiao / Wu, Zhizheng (2009): "A minimum v/u error approach to F0 generation in HMM-based TTS", In INTERSPEECH-2009, 408-411.