Voice conversion systems deal with the conversion of a speech signal to sound as if it was uttered by another speaker. The conversion of the spectral features has attracted a lot of research attention but the conversion of pitch, modeling the speaker-dependent prosody, is often achieved by just controlling the F0 level and range. However, the detailed prosody, including different linguistic units at several distinct temporal scales, can carry a significant amount of speaker identity related information. This paper introduces a new method for the conversion of the prosody, using wavelets to decompose the pitch contour into ten temporal scales ranging from microprosody to the utterance level, which allows modeling the different timings of the prosody phenomena. The prosody conversion is carried out in the wavelet domain, using regression techniques originally developed for the spectral conversion of speech. The performance of the proposed prosody conversion method is evaluated within a real voice conversion system. The results for cross-gender conversion indicate a significant improvement in naturalness when compared to the traditional approach of shifting and scaling the F0 to match the target speaker's mean and variance.
Bibliographic reference. Sanchez, Gerard / Silen, Hanna / Nurminen, Jani / Gabbouj, Moncef (2014): "Hierarchical modeling of F0 contours for voice conversion", In INTERSPEECH-2014, 2318-2321.