Encoding Individual Acoustic Features Using Dyad-Augmented Deep Variational Representations for Dialog-level Emotion Recognition

Jeng-Lin Li, Chi-Chun Lee


Face-to-face dyadic spoken dialog is a fundamental unit of human interaction. Despite numerous empirical evidences in demonstrating interlocutor's behavior dependency in dyadic interactions, few technical works exist in leveraging the unique pattern of dynamics in task of advancing emotion recognition during face-to-face settings. In this work, we propose a framework of encoding an individual's acoustic features with dyad-augmented deep networks. The dyad-augmented deep networks includes a general variational deep Gaussian Mixture embedding network and a dyad-specific fine-tuned network. Our framework utilizes the augmented dyad-specific feature space to incorporate the unique behavior pattern emerged when two people interact. We perform dialog-level emotion regression tasks in both the CreativeIT and the NNIME databases. We obtain affect regression accuracy of 0.544 and 0.387 for activation and valence in the CreativeIT database (a relative improvement of 4.41% and 4.03% compared to using features without augmenting the dyad-specific representation) and we obtain 0.700 and 0.604 (4.48% and 4.14% relative improvement) for regressing activation and valence in the NNIME database.


 DOI: 10.21437/Interspeech.2018-1455

Cite as: Li, J., Lee, C. (2018) Encoding Individual Acoustic Features Using Dyad-Augmented Deep Variational Representations for Dialog-level Emotion Recognition. Proc. Interspeech 2018, 3102-3106, DOI: 10.21437/Interspeech.2018-1455.


@inproceedings{Li2018,
  author={Jeng-Lin Li and Chi-Chun Lee},
  title={Encoding Individual Acoustic Features Using Dyad-Augmented Deep Variational Representations for Dialog-level Emotion Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3102--3106},
  doi={10.21437/Interspeech.2018-1455},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1455}
}