EUROSPEECH 2003 - INTERSPEECH 2003
In this paper, we report our recent work on speaker segmentation. Without a priori information about speaker number and speaker identities, the audio stream is segmented, and segments of the same speaker are grouped together. Speakers are represented by Gaussian Mixture Models (GMMs), then an HMM network is used for segmentation. However, unlike other model-based segmentation methods, the speaker GMMs are initialized using a simpler distance based segmentation algorithm. To group segments of identical speakers, a two-level clustering mechanism is introduced, which we found to achieve higher accuracy than direct distance based clustering methods. Our method significantly outperforms the best result reported at the 2002 Speaker Recognition Workshop. When tested on a professionally produced TV program set, our system reports only 3.5% frame errors.
Bibliographic reference. Yu, Peng / Seide, Frank / Ma, Chengyuan / Chang, Eric (2003): "An improved model-based speaker segmentation system", In EUROSPEECH-2003, 2025-2028.