In this paper we investigate the problem of identifying and localizing speakers with distant microphone arrays, thus extending the classical speaker diarization task to answer the question "who spoke when
and where". We consider a streaming audio scenario, where the diarization output is to be generated in realtime with as low latency as possible. Rather than carrying out the individual segmentation and classification tasks (speech detection, change detection, gender/speaker classification) sequentially, we propose a simultaneous segmentation and classification by applying a Viterbi decoder. It uses a transition matrix estimated online from position information and speaker change hypotheses, instead of fixed transition probabilities. This avoids early hard decisions and is shown to outperform the sequential approach.
Bibliographic reference. Schmalenstroeer, Joerg / Haeb-Umbach, Reinhold (2007): "Joint speaker segmentation, localization and identification for streaming audio", In INTERSPEECH-2007, 570-573.