Vision-based Active Speaker Detection in Multiparty Interaction

Kalin Stefanov, Jonas Beskow, Giampiero Salvi


This paper presents a supervised learning method for automatic visual detection of the active speaker in multiparty interactions. The presented detectors are built using a multimodal multiparty interaction dataset previously recorded with the purpose to explore patterns in the focus of visual attention of humans. Three different conditions are included: two humans involved in task-based interaction with a robot; the same two humans involved in task-based interaction where the robot is replaced by a third human, and a free three-party human interaction. The paper also presents an evaluation of the active speaker detection method in a speaker dependent experiment showing that the method achieves good accuracy rates in a fairly unconstrained scenario using only image data as input. The main goal of the presented method is to provide real-time detection of the active speaker within a broader framework implemented on a robot and used to generate natural focus of visual attention behavior during multiparty human-robot interactions.


 DOI: 10.21437/GLU.2017-10

Cite as: Stefanov, K., Beskow, J., Salvi, G. (2017) Vision-based Active Speaker Detection in Multiparty Interaction. Proc. GLU 2017 International Workshop on Grounding Language Understanding, 47-51, DOI: 10.21437/GLU.2017-10.


@inproceedings{Stefanov2017,
  author={Kalin Stefanov and Jonas Beskow and Giampiero Salvi},
  title={Vision-based Active Speaker Detection in Multiparty Interaction},
  year=2017,
  booktitle={Proc. GLU 2017 International Workshop on Grounding Language Understanding},
  pages={47--51},
  doi={10.21437/GLU.2017-10},
  url={http://dx.doi.org/10.21437/GLU.2017-10}
}