Hidden Markov model-based speech synthesis is prone to over-smoothing of spectral parameter trajectories. The maximum-likelihood parameter generation favors smooth tracks and the utterance-level variance of each parameter trajectory is significantly reduced compared to the original recordings. This results in muffled speech. To retain the natural variance, statistical global variance modeling has been used in parameter generation. The modeling increases the utterance-level variance in synthesis, but it is computationally demanding: there is no closed-form solution and an iterative approach is used. In this paper, we analyze the performance of two simple alternative approaches for retaining the natural variance of spectral parameters in synthesis, namely variance scaling and histogram equalization. Both methods apply analytically solvable parameter generation and impose the natural variance afterwards as an efficient post-processing step. Subjective evaluations carried out on English data confirm that the achieved synthesis quality is higher compared to simple post-filtering and similar to the standard global variance modeling.
Index Terms: statistical speech synthesis, global variance, variance scaling, histogram equalization
Bibliographic reference. Silén, Hanna / Helander, Elina / Nurminen, Jani / Gabbouj, Moncef (2012): "Ways to implement global variance in statistical speech synthesis", In INTERSPEECH-2012, 1436-1439.