Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition

Xinzhou Xu, Jun Deng, Nicholas Cummins, Zixing Zhang, Li Zhao, Björn W. Schuller


Conventionally, speech emotion recognition is achieved using passive learning approaches. Differing from such approaches, we herein propose and develop a dynamic method of autonomous emotion learning based on zero-shot learning. The proposed methodology employs emotional dimensions as the attributes in the zero-shot learning paradigm, resulting in two phases of learning, namely attribute learning and label learning. Attribute learning connects the paralinguistic features and attributes utilising speech with known emotional labels, while label learning aims at defining unseen emotions through the attributes. The experimental results achieved on the CINEMO corpus indicate that zero-shot learning is a useful technique for autonomous speech-based emotion learning, achieving accuracies considerably better than chance level and an attribute-based gold-standard setup. Furthermore, different emotion recognition tasks, emotional attributes, and employed approaches strongly influence system performance.


 DOI: 10.21437/Interspeech.2019-2406

Cite as: Xu, X., Deng, J., Cummins, N., Zhang, Z., Zhao, L., Schuller, B.W. (2019) Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition. Proc. Interspeech 2019, 949-953, DOI: 10.21437/Interspeech.2019-2406.


@inproceedings{Xu2019,
  author={Xinzhou Xu and Jun Deng and Nicholas Cummins and Zixing Zhang and Li Zhao and Björn W. Schuller},
  title={{Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={949--953},
  doi={10.21437/Interspeech.2019-2406},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2406}
}