16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

An Analysis of Time-Aggregated and Time-Series Features for Scoring Different Aspects of Multimodal Presentation Data

Vikram Ramanarayanan, Lei Chen, Chee Wee Leong, Gary Feng, David Suendermann-Oeft

Educational Testing Service, USA

We present a technique for automated assessment of public speaking and presentation proficiency based on the analysis of concurrently recorded speech and motion capture data. With respect to Kinect motion capture data, we examine both time-aggregated as well as time-series based features. While the former is based on statistical functionals of body-part position and/or velocity computed over the entire series, the latter feature set, dubbed histograms of cooccurrences, captures how often different broad postural configurations co-occur within different time lags of each other over the evolution of the multimodal time series. We examine the relative utility of these features, along with curated features derived from the speech stream, in predicting human-rated scores of different aspects of public speaking and presentation proficiency. We further show that these features outperform the human inter-rater agreement baseline for a subset of the analyzed aspects.

Full Paper

Bibliographic reference.  Ramanarayanan, Vikram / Chen, Lei / Leong, Chee Wee / Feng, Gary / Suendermann-Oeft, David (2015): "An analysis of time-aggregated and time-series features for scoring different aspects of multimodal presentation data", In INTERSPEECH-2015, 1373-1377.