Odyssey 2010: The Speaker and Language Recognition Workshop

Brno, Czech Republic
28 June 1 July 2010

Online Diarization of Telephone Conversations

Oshry Ben-Harush (1), Itshak Lapidot (2), Hugo Guterman (1)

(1) Ben-Gurion University, (2) Sami Shamoon College of Engineering

Speaker diarization systems attempts to perform segmentation and labeling of a conversation between R speakers, while no prior information is given regarding the conversation. Diarization systems basically tries to answer the question "Who spoke when?". In order to perform speaker diarization, most state of the art diarization systems operate in an off-line mode, that is, all of the samples of the audio stream are required prior to the application of the diarization algorithm. Off-line diarization algorithms generally relies on a dendogram or hierarchical clustering approach. Several on-line diarization systems has been previously suggested, however, most require some prior information or offline trained speaker and background models in order to conduct all or part of the diarization process. A new two-stage on-line diarization of telephone conversations algorithm is suggested in this study. On the first stage, a fully unsupervised diarization algorithm is applied over an initial training set of the conversation, this stage generates the speakers and non-speech models and tunes a hyper-state Hidden Markov Model (HMM) to be used on the second, on-line stage of diarization. On-line diarization is then applied by means of time-series clustering using the Viterbi dynamic programming algorithm. Employing this approach provides diarization results a few miliseconds following either a user request or once the conversation has concluded. In order to evaluate diarization performance, the diarization system was applied over 2048, 5Min length, two-speaker conversations extracted from the NIST 2005 Speaker Recognition Evaluation. On-line Diarization Error Rate (DER) is shown to approaches the "optimal" DER (achieved by applying unsupervised diarization over the entire conversation) as the length of the initial training set increases. Using an initial training set of 2Min and applying on-line diarization to the entire conversation incurred approximately 4% increase in DER compared to the "optimal" DER.

Full Paper (PDF)

Bibliographic reference.  Ben-Harush, Oshry / Lapidot, Itshak / Guterman, Hugo (2010): "Online Diarization of Telephone Conversations", In Odyssey-2010, paper 023.