Automated Emotion Morphing in Speech Based on Diffeomorphic Curve Registration and Highway Networks

Ravi Shankar, Hsi-Wei Hsieh, Nicolas Charon, Archana Venkataraman


We present a novel approach for emotion conversion that bridges the domains of speech analysis and computer vision. Our strategy is to warp the pitch contour of a source emotional utterance using diffeomorphic curve registration. The associated dynamical process pushes the original source contour towards that of a target emotional utterance. Mathematically, this warping process is completely specified by a set of initial momenta. Therefore, we use parallel data to train a highway neural network (HNet) to predict these initial momenta directly from the signal characteristics. The input features to the HNet include contextual pitch and spectral information. Once trained, the HNet is used to obtain the initial momenta for new utterances. From here, the diffeomorphic process takes over and warps the pitch contour accordingly. We validate our framework on the VESUS repository collected at Johns Hopkins University, which contains parallel emotional utterances from 10 actors. The proposed warping is more accurate that three state-of-the-art baselines for emotion conversion. We also evaluate the quality of our emotion manipulations via crowd sourcing.


 DOI: 10.21437/Interspeech.2019-2386

Cite as: Shankar, R., Hsieh, H., Charon, N., Venkataraman, A. (2019) Automated Emotion Morphing in Speech Based on Diffeomorphic Curve Registration and Highway Networks. Proc. Interspeech 2019, 4499-4503, DOI: 10.21437/Interspeech.2019-2386.


@inproceedings{Shankar2019,
  author={Ravi Shankar and Hsi-Wei Hsieh and Nicolas Charon and Archana Venkataraman},
  title={{Automated Emotion Morphing in Speech Based on Diffeomorphic Curve Registration and Highway Networks}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4499--4503},
  doi={10.21437/Interspeech.2019-2386},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2386}
}