8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


GMM-Based Voice Conversion Applied to Emotional Speech Synthesis

Hiromichi Kawanami (1), Yohei Iwami (1), Tomoki Toda (2), Hiroshi Saruwatari (1), Kiyohiro Shikano (1)

(1) Nara Institute of Science and Technology, Japan
(2) ATR-SLT, Japan

Voice conversion method is applied to synthesizing emotional speech from standard reading ( neutral) speech. Pairs of neutral speech and emotional speech are used for conversion rule training. The conversion adopts GMM (Gaussian Mixture Model) with DFW (Dynamic Frequency Warping). We also adopt STRAIGHT, the high-quality speech analysis-synthesis algorithm. As conversion target emotions, (Hot) anger, (cold) sadness and (hot) happiness are used. The converted speech is evaluated objectively first using mel cepstrum distortion as a criterion. The result confirms the GMM-based voice conversion can reduce distortion between target speech and neutral speech. A subjective test is also carried out to investigate perceptual effect. From the viewpoint of influence of prosody, two kinds of prosody are used to synthesis. One is natural prosody extracted from neutral speech and the other is from emotional speech. The result shows that prosody mainly contribute to emotion and spectrum conversion can reinforce it.

Full Paper

Bibliographic reference.  Kawanami, Hiromichi / Iwami, Yohei / Toda, Tomoki / Saruwatari, Hiroshi / Shikano, Kiyohiro (2003): "GMM-based voice conversion applied to emotional speech synthesis", In EUROSPEECH-2003, 2401-2404.