Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts

Liming Wang, Mark A. Hasegawa-Johnson


This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6% F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1% recall@10 on the standard 1000-image Flickr8k tests set.


 DOI: 10.21437/Interspeech.2019-1487

Cite as: Wang, L., Hasegawa-Johnson, M.A. (2019) Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts. Proc. Interspeech 2019, 2683-2687, DOI: 10.21437/Interspeech.2019-1487.


@inproceedings{Wang2019,
  author={Liming Wang and Mark A. Hasegawa-Johnson},
  title={{Multimodal Word Discovery and Retrieval with Phone Sequence and Image Concepts}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2683--2687},
  doi={10.21437/Interspeech.2019-1487},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1487}
}