A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective

Ravi Shankar, Jacob Sager, Archana Venkataraman


We introduce a new model for emotion conversion in speech based on highway neural networks. Our model uses the contextual pitch, energy and spectral information of a source emotional utterance to predict the framewise fundamental frequency and signal intensity under a target emotion. We also incorporate a latent gender representation to promote cross-speaker generalizability. Our neural network is trained to maximize the error log-likelihood under an assumed Laplacian distribution. We validate our model on the VESUS repository collected at Johns Hopkins University, which contains parallel emotional utterances from 10 actors across 5 emotional classes. The proposed algorithm outperforms three state-of-the-art baselines in terms of the mean absolute error and correlation between the predicted and target values. We evaluate the quality of our emotion manipulations via crowd-sourcing. Finally, we apply our emotion morphing model to utterances generated by Wavenet to demonstrate our unique ability to inject emotion into synthetic speech.


 DOI: 10.21437/Interspeech.2019-2512

Cite as: Shankar, R., Sager, J., Venkataraman, A. (2019) A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective. Proc. Interspeech 2019, 2848-2852, DOI: 10.21437/Interspeech.2019-2512.


@inproceedings{Shankar2019,
  author={Ravi Shankar and Jacob Sager and Archana Venkataraman},
  title={{A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2848--2852},
  doi={10.21437/Interspeech.2019-2512},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2512}
}