Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

Who Spoke When? - Automatic Segmentation and Clustering for Determining Speaker Turns

S. E. Johnson

Cambridge University Engineering Department, Cambridge, UK

The problem of labelling speaker turns by automaticallysegmenting and clustering a continuous audio streamis addressed. A new clustering scheme is presentedand evaluated using a clustering efficiency score whichtreats both agglomerative and divisive clustering strategies equally. Results show an efficiency of 70% can beobtained on both manually and automatically derivedsegments on the 1996 Hub4 development data.For the task of identifying potentially unknown anchorspeakers within broadcast news shows, the frame classification error rate is very important. To re ect this, aframe-based cluster efficiency is defined and the resultsshow a 90% frame-based efficiency can be achieved. Finally a frame-based comparison between the manuallyand automatically derived segment/cluster sets showsthat approximately one third of the errors are introducedduring segmentation and two-thirds during clustering.

Full Paper (PDF)   Gnu-Zipped Postscript

Bibliographic reference.  Johnson, S. E. (1999): "Who spoke when? - automatic segmentation and clustering for determining speaker turns", In EUROSPEECH'99, 2211-2214.