CycleGAN-Based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition

Fang Bao, Michael Neumann, Ngoc Thang Vu


Cycle consistent adversarial networks (CycleGAN) have shown great success in image style transfer with unpaired datasets. Inspired by this, we investigate emotion style transfer to generate synthetic data, which aims at addressing the data scarcity problem in speech emotion recognition. Specifically, we propose a CycleGAN-based method to transfer feature vectors extracted from a large unlabeled speech corpus into synthetic features representing the given target emotions. We extend the CycleGAN framework with a classification loss which improves the discriminability of the generated data. To show the effectiveness of the proposed method, we present results for speech emotion recognition using the generated feature vectors as (i) augmentation of the training data, and (ii) as standalone training set. Our experimental results reveal that when utilizing synthetic feature vectors, the classification performance improves in within-corpus and cross-corpus evaluation.


 DOI: 10.21437/Interspeech.2019-2293

Cite as: Bao, F., Neumann, M., Vu, N.T. (2019) CycleGAN-Based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition. Proc. Interspeech 2019, 2828-2832, DOI: 10.21437/Interspeech.2019-2293.


@inproceedings{Bao2019,
  author={Fang Bao and Michael Neumann and Ngoc Thang Vu},
  title={{CycleGAN-Based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2828--2832},
  doi={10.21437/Interspeech.2019-2293},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2293}
}