Survey Talk: Multimodal Processing of Speech and Language

Florian Metze


Human information processing is inherently multimodal. Speech and language are therefore best processed and generated in a situated context. Future human language technologies must be able to jointly process multimodal data, and not just text, images, acoustics or speech in isolation. Despite advances in Computer Vision, Automatic Speech Recognition, Multimedia Analysis and Natural Language Processing, state-of-the-art computational models are not integrating multiple modalities nowhere near as effectively and efficiently as humans. Researchers are only beginning to tackle these challenges in “vision and language” research. In this talk, I will show the potential of multi-modal processing to (1) improve recognition for challenging conditions (i.e. lip-reading), (2) adapt models to new conditions (i.e. context or personalization), (3) ground semantics across modalities or languages (i.e. translation and language acquisition), (4) training models with weak or non-existent labels (i.e. SoundNet or bootstrapping of recognizers without parallel data), and (5) make models interpretable (i.e. representation learning). I will present and discuss significant recent research results from each of these areas and will highlight the commonalities and differences. I hope to stimulate exchange and cross-fertilization of ideas by presenting not just abstract concepts, but by pointing the audience to new and existing tasks, datasets, and challenges.


Cite as: Metze, F. (2019) Survey Talk: Multimodal Processing of Speech and Language. Proc. Interspeech 2019.


@inproceedings{Metze2019,
  author={Florian Metze},
  title={{Survey Talk: Multimodal Processing of Speech and Language}},
  year=2019,
  booktitle={Proc. Interspeech 2019}
}