EUROSPEECH 2003 - INTERSPEECH 2003
Voice conversion method is applied to synthesizing emotional speech from standard reading ( neutral) speech. Pairs of neutral speech and emotional speech are used for conversion rule training. The conversion adopts GMM (Gaussian Mixture Model) with DFW (Dynamic Frequency Warping). We also adopt STRAIGHT, the high-quality speech analysis-synthesis algorithm. As conversion target emotions, (Hot) anger, (cold) sadness and (hot) happiness are used. The converted speech is evaluated objectively first using mel cepstrum distortion as a criterion. The result confirms the GMM-based voice conversion can reduce distortion between target speech and neutral speech. A subjective test is also carried out to investigate perceptual effect. From the viewpoint of influence of prosody, two kinds of prosody are used to synthesis. One is natural prosody extracted from neutral speech and the other is from emotional speech. The result shows that prosody mainly contribute to emotion and spectrum conversion can reinforce it.
Bibliographic reference. Kawanami, Hiromichi / Iwami, Yohei / Toda, Tomoki / Saruwatari, Hiroshi / Shikano, Kiyohiro (2003): "GMM-based voice conversion applied to emotional speech synthesis", In EUROSPEECH-2003, 2401-2404.