AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Sourish Chaudhuri, Joseph Roth, Daniel P. W. Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru, Nathan Reale, Loretta Guarino Reid, Kevin Wilson, Zhonghua Xi


Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music and speech co-occurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlapping noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.


 DOI: 10.21437/Interspeech.2018-2028

Cite as: Chaudhuri, S., Roth, J., Ellis, D.P.W., Gallagher, A., Kaver, L., Marvin, R., Pantofaru, C., Reale, N., Guarino Reid, L., Wilson, K., Xi, Z. (2018) AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies. Proc. Interspeech 2018, 1239-1243, DOI: 10.21437/Interspeech.2018-2028.


@inproceedings{Chaudhuri2018,
  author={Sourish Chaudhuri and Joseph Roth and Daniel P. W. Ellis and Andrew Gallagher and Liat Kaver and Radhika Marvin and Caroline Pantofaru and Nathan Reale and Loretta {Guarino Reid} and Kevin Wilson and Zhonghua Xi},
  title={AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1239--1243},
  doi={10.21437/Interspeech.2018-2028},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2028}
}