INTERSPEECH 2006 - ICSLP
The process of locating the end points of each speakers voice in an audio file and then clustering segments based in speaker identity is called speaker segmentation. In this paper we present a method for two speaker segmentation, though it can be extended to more than two speakers. Most methods for speaker segmentation and clustering start with an initial computationally inexpensive speaker segmentation method, followed by a more accurate segment clustering. In this paper we describe a simple algorithm that improves the accuracy of the segment clustering while not increasing the computational complexity. Since the clustering is done iteratively, the improvement in each segment clustering step results in a significant overall increase in segmentation accuracy and cluster purity. We borrow ideas from speaker recognition to perform segment clustering by frame based voting. We look at each frame as an independent classifier deciding which speaker generated that segment. These ‘classifiers’ are combined by voting to make a decision as to which segments should be clustered together. This simple change leads to 56.9% decrease in error rates on a segmentation task for the SWITCHBOARD corpus.
Bibliographic reference. Narayanaswamy, Balakrishnan / Gangadharaiah, Rashmi / Stern, Richard M. (2006): "Voting for two speaker segmentation", In INTERSPEECH-2006, paper 1932-Wed3CaP.3.