ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

Fusing audio and video information for online speaker diarization

Joerg Schmalenstroeer, Martin Kelling, Volker Leutnant, Reinhold Haeb-Umbach

In this paper we present a system for identifying and localizing speakers using distant microphone arrays and a steerable pan-tilt-zoom camera. Audio and video streams are processed in real-time to obtain the diarization information “who speaks when and where” with low latency to be used in advanced video conferencing systems or user-adaptive interfaces. A key feature of the proposed system is to first glean information about the speaker’s location and identity from the audio and visual data streams separately and then to fuse these data in a probabilistic framework employing the Viterbi algorithm. Here, visual evidence of a person is utilized through a priori state probabilities, while location and speaker change information are employed via time-variant transition probabilities. Experiments show that video information yields a substantial improvement compared to pure audio-based diarization.


doi: 10.21437/Interspeech.2009-338

Cite as: Schmalenstroeer, J., Kelling, M., Leutnant, V., Haeb-Umbach, R. (2009) Fusing audio and video information for online speaker diarization. Proc. Interspeech 2009, 1163-1166, doi: 10.21437/Interspeech.2009-338

@inproceedings{schmalenstroeer09_interspeech,
  author={Joerg Schmalenstroeer and Martin Kelling and Volker Leutnant and Reinhold Haeb-Umbach},
  title={{Fusing audio and video information for online speaker diarization}},
  year=2009,
  booktitle={Proc. Interspeech 2009},
  pages={1163--1166},
  doi={10.21437/Interspeech.2009-338}
}