Conversational engagement is a multimodal phenomenon and an essential cue to assess both human-human and human-robot communication. Speaker-dependent and speaker-independent scenarios were addressed in our engagement study. Handcrafted audio-visual features were used. Fixed window sizes for feature fusion method were analysed. Novel dynamic window size selection and multimodal bi-directional long short term memory (Multimodal BLSTM) approaches were proposed and evaluated for engagement level recognition.
Cite as: Huang, Y., Gilmartin, E., Campbell, N. (2017) Speaker Dependency Analysis, Audiovisual Fusion Cues and a Multimodal BLSTM for Conversational Engagement Recognition. Proc. Interspeech 2017, 3359-3363, doi: 10.21437/Interspeech.2017-1496
@inproceedings{huang17f_interspeech, author={Yuyun Huang and Emer Gilmartin and Nick Campbell}, title={{Speaker Dependency Analysis, Audiovisual Fusion Cues and a Multimodal BLSTM for Conversational Engagement Recognition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3359--3363}, doi={10.21437/Interspeech.2017-1496} }