15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Multimodal Understanding for Person Recognition in Video Broadcasts

Frederic Bechet (1), Meriem Bendris (1), Delphine Charlet (2), Géraldine Damnati (2), Benoit Favre (1), Mickael Rouvier (1), Remi Auguste (3), Benjamin Bigot (4), Richard Dufour (4), Corinne Fredouille (4), Georges Linarès (4), Jean Martinet (3), Gregory Senay (4), Pierre Tirilly (3)

(1) LIF (UMR 7279), France
(2) Orange Labs, France
(3) LIFL, France
(4) LIA, France

This paper describes a multi-modal person recognition system for video broadcast developed for participating in the Defi-Repere challenge. The main track of this challenge targets the identification of all persons occurring in a video either in the audio modality (speakers) or the image modality (faces). This system is developed by the PERCOL team involving 4 research labs in France and was ranked first at the 2014 Defi-Repere challenge. The main scientific issue addressed by this challenge is the combination of audio and video information extraction processes for improving the extraction performance in both modalities. In this paper, we present the strategy followed by the PERCOL team for speaker identification based on enriching the speaker diarization with features related to the “understanding” of the video scenes: text overlay transcription and analysis, automatic situation identification (TV set, report), the amount of people visible, TV set disposition and even the camera when available. Experiments on the REPERE corpus show interesting results on the speaker identification system enriched by the scene understanding features and the usefulness of the speaker to identify faces.

Full Paper

Bibliographic reference.  Bechet, Frederic / Bendris, Meriem / Charlet, Delphine / Damnati, Géraldine / Favre, Benoit / Rouvier, Mickael / Auguste, Remi / Bigot, Benjamin / Dufour, Richard / Fredouille, Corinne / Linarès, Georges / Martinet, Jean / Senay, Gregory / Tirilly, Pierre (2014): "Multimodal understanding for person recognition in video broadcasts", In INTERSPEECH-2014, 607-611.