Studies show that professional singing matches well the associated melody and typically exhibits spectra different from speech in resonance tuning and singing formant. Therefore, one of the important topics in speech-to-singing conversion is to characterize the spectral transformation between speech and singing. This paper extends two types of spectral transformation techniques, namely voice conversion and model adaptation, and examines their performance. For the first time, we carry out a comparative study over four singing voice synthesis techniques. The experiments on various data sizes reveal that maximum-likelihood Gaussian mixture model (ML-GMM) of voice conversion always delivers the best performance in terms of spectral estimation accuracy; while model adaptation generates the best singing quality in all cases. When a large dataset is available, both techniques achieve the highest similarity to target singing. With a small dataset, the highest similarity is obtained by ML-GMM. It is also found that the music context-dependent modeling in adaptation, in which detailed partition of transform space is involved, leads to pleasant singing spectra.
Bibliographic reference. Lee, S. W. / Wu, Zhizheng / Dong, Minghui / Tian, Xiaohai / Li, Haizhou (2014): "A comparative study of spectral transformation techniques for singing voice synthesis", In INTERSPEECH-2014, 2499-2503.