Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents

Chaoran Liu, Carlos Ishi, Hiroshi Ishiguro


A natural conversation involves rapid exchanges of turns while talking. Taking turns at appropriate timing or intervals is a requisite feature for a dialog system as a conversation partner. This paper proposes a model that estimates the timing of turn-taking during verbal interactions. Unlike previous studies, our proposed model does not rely on a silence region between sentences since a dialog system must respond without large gaps or overlaps. We propose a Recurrent Neural Network (RNN) based model that takes the joint embedding of lexical and prosodic contents as its input to classify utterances into turn-taking related classes and estimates the turn-taking timing. To this end, we trained a neural network to embed the lexical contents, the fundamental frequencies, and the speech power into a joint embedding space. To learn meaningful embedding spaces, the prosodic features from each single utterance are pre-trained using RNN and combined with utterance lexical embedding as the input of our proposed model. We tested this model on a spontaneous conversation dataset and confirmed that it outperformed the use of word embedding-based features.


 DOI: 10.21437/Interspeech.2017-965

Cite as: Liu, C., Ishi, C., Ishiguro, H. (2017) Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents. Proc. Interspeech 2017, 1686-1690, DOI: 10.21437/Interspeech.2017-965.


@inproceedings{Liu2017,
  author={Chaoran Liu and Carlos Ishi and Hiroshi Ishiguro},
  title={Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1686--1690},
  doi={10.21437/Interspeech.2017-965},
  url={http://dx.doi.org/10.21437/Interspeech.2017-965}
}