In this paper we present a system for identifying and localizing speakers using distant microphone arrays and a steerable pan-tilt-zoom camera. Audio and video streams are processed in real-time to obtain the diarization information “who speaks when and where” with low latency to be used in advanced video conferencing systems or user-adaptive interfaces. A key feature of the proposed system is to first glean information about the speaker’s location and identity from the audio and visual data streams separately and then to fuse these data in a probabilistic framework employing the Viterbi algorithm. Here, visual evidence of a person is utilized through a priori state probabilities, while location and speaker change information are employed via time-variant transition probabilities. Experiments show that video information yields a substantial improvement compared to pure audio-based diarization.
Bibliographic reference. Schmalenstroeer, Joerg / Kelling, Martin / Leutnant, Volker / Haeb-Umbach, Reinhold (2009): "Fusing audio and video information for online speaker diarization", In INTERSPEECH-2009, 1163-1166.