Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio

Emmanuel Azuh, David Harwath, James Glass


In this paper, we present a method for the discovery of word-like units and their approximate translations from visually grounded speech across multiple languages. We first train a neural network model to map images and their spoken audio captions in both English and Hindi to a shared, multimodal embedding space. Next, we use this model to segment and cluster regions of the spoken captions which approximately correspond to words. Finally, we exploit between-cluster similarities in the embedding space to associate English pseudo-word clusters with Hindi pseudo-word clusters, and show that many of these cluster pairings capture semantic translations between English and Hindi words. We present quantitative cross-lingual clustering results, as well as qualitative results in the form of a bilingual picture dictionary.


 DOI: 10.21437/Interspeech.2019-1718

Cite as: Azuh, E., Harwath, D., Glass, J. (2019) Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio. Proc. Interspeech 2019, 276-280, DOI: 10.21437/Interspeech.2019-1718.


@inproceedings{Azuh2019,
  author={Emmanuel Azuh and David Harwath and James Glass},
  title={{Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={276--280},
  doi={10.21437/Interspeech.2019-1718},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1718}
}