This paper describes a system for predicting discourse-role features based on voice-activity detection. It takes as input a vector of values extracted from conversational speech and predicts turn-taking activity and active-listening patterns using an echo-state network. We observed evidence of frame-attunement using a measure of speech density which takes the ratio of speech to non-speech behaviour per utterance. We noted a synchrony of utterance timing and modelled this using the ESN. The system was trained on a subset of data from 100 telephone conversations from the 1,500-hour JST Expressive Speech Processing corpus, and predicts the interlocutor's timing behaviour with an error-rate of less than 15% based on one partner's speech-activity alone. An integrated system with access to content information would of course perform at higher rates.
Bibliographic reference. Campbell, Nick / Scherer, Stefan (2010): "Comparing measures of synchrony and alignment in dialogue speech timing with respect to turn-taking activity", In INTERSPEECH-2010, 2546-2549.