The primary use of speech is in face-to-face interactions and situational context and human behavior therefore intrinsically shape and affect communication. In order to usefully model situational awareness, machines must have access to the same streams of information humans have access to. In other words, we need to provide machines with features that represent each communicative modality: face and gesture, voice and speech, and language. This paper presents OpenMM: an open-source multimodal feature extraction tool. We build upon existing open-source repositories to present the first publicly available tool for multimodal feature extraction. The tool provides a pipeline for researchers to easily extract visual and acoustic features. In addition, the tool also performs automatic speech recognition (ASR) and then uses the transcripts to extract linguistic features. We evaluate the OpenMM’s multimodal feature set on deception, depression and sentiment classification tasks and show its performance is very promising. This tool provides researchers with a simple way of extracting multimodal features and consequently a richer and more robust feature representation for machine learning tasks.
Cite as: Morales, M.R., Scherer, S., Levitan, R. (2017) OpenMM: An Open-Source Multimodal Feature Extraction Tool. Proc. Interspeech 2017, 3354-3358, doi: 10.21437/Interspeech.2017-1382
@inproceedings{morales17_interspeech, author={Michelle Renee Morales and Stefan Scherer and Rivka Levitan}, title={{OpenMM: An Open-Source Multimodal Feature Extraction Tool}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3354--3358}, doi={10.21437/Interspeech.2017-1382} }