Most existing datasets for speaker identification contain samples obtained
under quite constrained conditions, and are usually hand-annotated,
hence limited in size. The goal of this paper is to generate a large
scale text-independent speaker identification dataset collected ‘in
the wild’.
We make two contributions. First, we propose a fully automated
pipeline based on computer vision techniques to create the dataset
from open-source media. Our pipeline involves obtaining videos from
YouTube; performing active speaker verification using a two-stream
synchronization Convolutional Neural Network (CNN), and confirming
the identity of the speaker using CNN based facial recognition. We
use this pipeline to curate VoxCeleb which contains hundreds of thousands
of ‘real world’ utterances for over 1,000 celebrities.
Our second contribution is to apply and compare various state
of the art speaker identification techniques on our dataset to establish
baseline performance. We show that a CNN based architecture obtains
the best performance for both identification and verification.
Cite as: Nagrani, A., Chung, J.S., Zisserman, A. (2017) VoxCeleb: A Large-Scale Speaker Identification Dataset. Proc. Interspeech 2017, 2616-2620, doi: 10.21437/Interspeech.2017-950
@inproceedings{nagrani17_interspeech, author={Arsha Nagrani and Joon Son Chung and Andrew Zisserman}, title={{VoxCeleb: A Large-Scale Speaker Identification Dataset}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2616--2620}, doi={10.21437/Interspeech.2017-950} }