Visually-grounded spoken language datasets can enable models to learn
cross-modal correspondences with very weak supervision. However, modern
audio-visual datasets contain biases that undermine the real-world
performance of models trained on that data. We introduce Spoken ObjectNet,
which is designed to remove some of these biases and provide a way
to better evaluate how effectively models will perform in real-world
scenarios. This dataset expands upon ObjectNet, which is a bias-controlled
image dataset that features similar image classes to those present
in ImageNet.
We detail our data collection pipeline, which features several
methods to improve caption quality, including automated language model
checks. Lastly, we show baseline results on image retrieval and audio
retrieval tasks. These results show that models trained on other datasets
and then evaluated on Spoken ObjectNet tend to perform poorly due to
biases in other datasets that the models have learned. We also show
evidence that the performance decrease is due to the dataset controls,
and not the transfer setting.
Cite as: Palmer, I., Rouditchenko, A., Barbu, A., Katz, B., Glass, J. (2021) Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. Proc. Interspeech 2021, 3650-3654, doi: 10.21437/Interspeech.2021-245
@inproceedings{palmer21_interspeech, author={Ian Palmer and Andrew Rouditchenko and Andrei Barbu and Boris Katz and James Glass}, title={{Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3650--3654}, doi={10.21437/Interspeech.2021-245} }