GTH-UPM System for Albayzin Multimodal Diarization Challenge 2020

Cristina Luna-Jiménez, Ricardo Kleinlein, Fernando Fernández-Martínez, José Manuel Pardo-Muñoz, José Manuel Moya-Fernández

This paper describes the multimodal diarization system proposed by the GTH-UPM team to Albayzin Multimodal Diarization Challenge 2020. The submitted solution consists of 2 separate diarization systems that work on visual and aural components.

The visual diarization solution exploits web resources, as well as provided enrollment images. First, these images feed a facial detector. Next, all the discovered faces are introduced into FaceNet to generate embeddings. After this, we apply a clustering algorithm on extracted embeddings, obtaining a representative cluster for each participant. Each centroid of the representative clusters acts as a participant model. When a new embedding extracted from a facial image of the program arrives at the system, it receives the label that corresponds to the closer centroid identity among all the given participants, as long as it overpasses a fixed quality threshold.

The aural speaker diarization problem is tackled as a classification task, in which a deep learning model learns the mapping between automatically-extracted sequences of aural x-vectors and speaker identities. These sequences aid in overcoming the scarcity of training samples per speaker.

The best results sent reached a DER of 66.94% for visual diarization and a DER of 125.24% for aural diarization on the test set.

doi: 10.21437/IberSPEECH.2021-15

Luna-Jiménez, C, Kleinlein, R, Fernández-Martínez, F, Pardo-Muñoz, J.M, Moya-Fernández, J.M (2021) GTH-UPM System for Albayzin Multimodal Diarization Challenge 2020. Proc. IberSPEECH 2021, 71-75, doi: 10.21437/IberSPEECH.2021-15.