The Continuous Wavelet Transform (CWT) has been recently proposed to model F0 in the context of speech synthesis. It was shown that systems using signal decomposition with the CWT tend to outperform systems that model the signal directly. The F0 signal is typically decomposed into various scales of differing frequency. In these experiments, we reconstruct F0 with selected frequencies and ask native listeners to judge the naturalness of synthesized utterances with respect to natural speech. Results indicate that HMM-generated F0 is comparable to the CWT low frequencies, suggesting it mostly generates utterances with neutral intonation. Middle frequencies achieve very high levels of naturalness, while very high frequencies are mostly noise.
Bibliographic reference. Ribeiro, Manuel Sam / Yamagishi, Junichi / Clark, Robert A. J. (2015): "A perceptual investigation of wavelet-based decomposition of F0 for text-to-speech synthesis", In INTERSPEECH-2015, 1586-1590.