Bags in Bag: Generating Context-Aware Bags for Tracking Emotions from Speech

Jing Han, Zixing Zhang, Maximilian Schmitt, Zhao Ren, Fabien Ringeval, Björn Schuller

Whereas systems based on deep learning have been proposed to learn efficient representations of emotional speech data, methods such as Bag-of-Audio-Words (BoAW) have yielded similar or even better performance while providing understandable representations of the data. In those representations, however, context information is overlooked as the BoAW include only local information. In this paper, we propose to learn a novel representation ‘Bag-of-Context-Aware-Words’ that encapsulates the context with neighbouring frames of BoAW; segment-level BoAW are extracted in the first layer which are then utilised to create a final instance-level bag. Such a hierarchical structure of BoAW enables the system to learn representations with context information. To evaluate the effectiveness of the method, we perform extensive experiments on a time- and value-continuous spontaneous emotion database: RECOLA. The results show that, the best segment length for valence is twice of that for arousal, suggesting that the prediction of the emotional valence requires more context information than the prediction of arousal and the performance obtained on RECOLA with the proposed Bag-of-Context-Aware-Words outperforms all previously reported results.

 DOI: 10.21437/Interspeech.2018-996

Cite as: Han, J., Zhang, Z., Schmitt, M., Ren, Z., Ringeval, F., Schuller, B. (2018) Bags in Bag: Generating Context-Aware Bags for Tracking Emotions from Speech. Proc. Interspeech 2018, 3082-3086, DOI: 10.21437/Interspeech.2018-996.

  author={Jing Han and Zixing Zhang and Maximilian Schmitt and Zhao Ren and Fabien Ringeval and Björn Schuller},
  title={Bags in Bag: Generating Context-Aware Bags for Tracking Emotions from Speech},
  booktitle={Proc. Interspeech 2018},