This paper presents Cisco's speaker segmentation and recognition (SSR) system, which is a part of a commercial product. Cisco SSR uses speaker segmentation and speaker recognition algorithms with a crowd sourcing approach to create speaker metadata. The speaker metadata makes the enterprise videos more accessible and more navigable by itself, and by its combination with other forms of metadata such as keywords. This paper illustrates various functional blocks of SSR and a typical user interface. The paper describes the specific implementations of speaker segmentation and recognition algorithms. The paper also describes the evaluation data and protocols plus results for both speaker segmentation and speaker recognition tasks. Speaker segmentation results show that Cisco SSR performs comparable to the state-of-the-art on RT-03F data. Speaker recognition results show that a small set of user provided labels can be effectively transferred to a continuously expanding set of videos.
Cite as: Kajarekar, S., Khare, A., Paulik, M., Agrawal, N., Panchapagesan, P., Sankar, A., Gannu, S. (2012) Cisco's speaker segmentation and recognition system. Proc. The Speaker and Language Recognition Workshop (Odyssey 2012), 151-156
@inproceedings{kajarekar12_odyssey, author={Sashin Kajarekar and Aparna Khare and Matthias Paulik and Neha Agrawal and Panchi Panchapagesan and Ananth Sankar and Satish Gannu}, title={{Cisco's speaker segmentation and recognition system}}, year=2012, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2012)}, pages={151--156} }