Enhancing Backchannel Prediction Using Word Embeddings

Robin Ruede, Markus Müller, Sebastian Stüker, Alex Waibel


Backchannel responses like “uh-huh”, “yeah”, “right” are used by the listener in a social dialog as a way to provide feedback to the speaker. In the context of human-computer interaction, these responses can be used by an artificial agent to build rapport in conversations with users. In the past, multiple approaches have been proposed to detect backchannel cues and to predict the most natural timing to place those backchannel utterances. Most of these are based on manually optimized fixed rules, which may fail to generalize. Many systems rely on the location and duration of pauses and pitch slopes of specific lengths. In the past, we proposed an approach by training artificial neural networks on acoustic features such as pitch and power and also attempted to add word embeddings via word2vec. In this work, we refined this approach by evaluating different methods to add timed word embeddings via word2vec. Comparing the performance using various feature combinations, we could show that adding linguistic features improves the performance over a prediction system that only uses acoustic features.


 DOI: 10.21437/Interspeech.2017-1606

Cite as: Ruede, R., Müller, M., Stüker, S., Waibel, A. (2017) Enhancing Backchannel Prediction Using Word Embeddings. Proc. Interspeech 2017, 879-883, DOI: 10.21437/Interspeech.2017-1606.


@inproceedings{Ruede2017,
  author={Robin Ruede and Markus Müller and Sebastian Stüker and Alex Waibel},
  title={Enhancing Backchannel Prediction Using Word Embeddings},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={879--883},
  doi={10.21437/Interspeech.2017-1606},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1606}
}