Speaker Dependency Analysis, Audiovisual Fusion Cues and a Multimodal BLSTM for Conversational Engagement Recognition

Yuyun Huang, Emer Gilmartin, Nick Campbell


Conversational engagement is a multimodal phenomenon and an essential cue to assess both human-human and human-robot communication. Speaker-dependent and speaker-independent scenarios were addressed in our engagement study. Handcrafted audio-visual features were used. Fixed window sizes for feature fusion method were analysed. Novel dynamic window size selection and multimodal bi-directional long short term memory (Multimodal BLSTM) approaches were proposed and evaluated for engagement level recognition.


 DOI: 10.21437/Interspeech.2017-1496

Cite as: Huang, Y., Gilmartin, E., Campbell, N. (2017) Speaker Dependency Analysis, Audiovisual Fusion Cues and a Multimodal BLSTM for Conversational Engagement Recognition. Proc. Interspeech 2017, 3359-3363, DOI: 10.21437/Interspeech.2017-1496.


@inproceedings{Huang2017,
  author={Yuyun Huang and Emer Gilmartin and Nick Campbell},
  title={Speaker Dependency Analysis, Audiovisual Fusion Cues and a Multimodal BLSTM for Conversational Engagement Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3359--3363},
  doi={10.21437/Interspeech.2017-1496},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1496}
}