In this paper we describe the process of converting a research prototype system for Speaker Diarization into a fully deployed product running in real time and with low latency. The deployment is a part of the IBM Cloud Speech-to-Text (STT) Service. First, the prototype system is described and the requirements for the on-line, deployable system are introduced. Then we describe the technical approaches we took to satisfy these requirements and discuss some of the challenges we have faced. In particular, we present novel ideas for speeding up the system by using Automatic Speech Recognition (ASR) transcripts as an input to diarization, we introduce a concept of active window to keep the computational complexity linear, we improve the speaker model using a new speaker-clustering algorithm, we automatically keep track of the number of active speakers and we enable the users to set an operating point on a continuous scale between low latency and optimal accuracy. The deployed system has been tuned on real-life data reaching average Speaker Error Rates around 3% and improving over the prototype system by about 10% relative.
Cite as: Dimitriadis, D., Fousek, P. (2017) Developing On-Line Speaker Diarization System. Proc. Interspeech 2017, 2739-2743, doi: 10.21437/Interspeech.2017-166
@inproceedings{dimitriadis17_interspeech, author={Dimitrios Dimitriadis and Petr Fousek}, title={{Developing On-Line Speaker Diarization System}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2739--2743}, doi={10.21437/Interspeech.2017-166} }