Speaker diarization of a collection of recordings with uniquely identified speakers is a challenging task. A system addressing such task must account for the inter-session variability present from recording to recording and it is asked to scale well to massive amounts of data. In this paper we use a two-stage approach to corpus-wide speaker diarization involving speaker diarization and speaker linking stages. The speaker linking system agglomeratively clusters speaker factor posterior distributions obtained via Joint Factor Analysis using the Ward method and the Hotteling t-square statistic as distance measure. We extend this framework to link speakers based on both speech and visual modalities to improve the robustness of the system. The system is evaluated using the data collected for the Augmented Multiparty Interaction (AMI) project, involving over one hundred meetings. We provide results in terms of within-recording and across-recording diarization error rates (DER) to support the effectiveness of multi-modal speaker linking to enable large scale speaker diarization.
Bibliographic reference. Ferràs, Marc / Masneri, Stefano / Schreer, Oliver / Bourlard, Hervé (2014): "Diarizing large corpora using multi-modal speaker linking", In INTERSPEECH-2014, 602-606.