Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC

Hirofumi Inaguma, Koji Inoue, Masato Mimura, Tatsuya Kawahara


Non-verbal speech cues such as laughter and fillers, which are collectively called social signals, play an important role in human communication. Therefore, detection of them would be useful for dialogue systems to infer speaker’s intentions, emotions and engagements. The conventional approaches are based on frame-wise classifiers, which require precise time-alignment of these events for training. This work investigates the Connectionist Temporal Classification (CTC) approach which can learn an alignment between the input and its target label sequence. This allows for robust detection of the events and efficient training without precise time information. Experimental evaluations with various settings demonstrate that CTC based on bidirectional LSTM outperforms the conventional DNN and HMM based methods.


 DOI: 10.21437/Interspeech.2017-457

Cite as: Inaguma, H., Inoue, K., Mimura, M., Kawahara, T. (2017) Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC. Proc. Interspeech 2017, 1691-1695, DOI: 10.21437/Interspeech.2017-457.


@inproceedings{Inaguma2017,
  author={Hirofumi Inaguma and Koji Inoue and Masato Mimura and Tatsuya Kawahara},
  title={Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1691--1695},
  doi={10.21437/Interspeech.2017-457},
  url={http://dx.doi.org/10.21437/Interspeech.2017-457}
}