This paper proposes a novel method to transplant emotions to a new speaker in statistical speech synthesis based on an emotion additive model (EAM), which represents the differences between emotional and neutral voices. This method trains EAM using neutral and emotional speech data of multiple speakers and applies it to a neutral voice model of a new speaker (target). There is some degradation in speech quality due to a mismatch in speakers between the EAM and the target neutral voice model. To alleviate the mismatch, we introduce an eigenvoice technique to this framework. We build neutral voice models and EAMs using multiple speakers, and construct an eigenvoice space consisting the neutral voice models and EAMs. To transplant the emotion to the target speaker, the proposed method estimates weights of eigenvoices for the target neutral speech data based on a maximum likelihood criteria. The EAM of the target speaker is obtained by applying the estimated weights to the EAM parameters of the eigenvoice space. Emotional speech is generated using the EAM and the neutral voice model. Experimental results show that the proposed method performs emotional speech synthesis with reasonable emotions and high speech quality.
Bibliographic reference. Ohtani, Yamato / Nasu, Yu / Morita, Masahiro / Akamine, Masami (2015): "Emotional transplant in statistical speech synthesis based on emotion additive model", In INTERSPEECH-2015, 274-278.