Speech and Text Analysis for Multimodal Addressee Detection in Human-Human-Computer Interaction

Oleg Akhtiamov, Maxim Sidorov, Alexey A. Karpov, Wolfgang Minker


The necessity of addressee detection arises in multiparty spoken dialogue systems which deal with human-human-computer interaction. In order to cope with this kind of interaction, such a system is supposed to determine whether the user is addressing the system or another human. The present study is focused on multimodal addressee detection and describes three levels of speech and text analysis: acoustical, syntactical, and lexical. We define the connection between different levels of analysis and the classification performance for different categories of speech and determine the dependence of addressee detection performance on speech recognition accuracy. We also compare the obtained results with the results of the original research performed by the authors of the Smart Video Corpus which we use in our computations. Our most effective meta-classifier working with acoustical, syntactical, and lexical features reaches an unweighted average recall equal to 0.917 showing almost a nine percent advantage over the best baseline model, though this baseline classifier additionally uses head orientation data. We also propose a universal meta-model based on acoustical and syntactical analysis, which may theoretically be applied in different domains.


 DOI: 10.21437/Interspeech.2017-501

Cite as: Akhtiamov, O., Sidorov, M., Karpov, A.A., Minker, W. (2017) Speech and Text Analysis for Multimodal Addressee Detection in Human-Human-Computer Interaction. Proc. Interspeech 2017, 2521-2525, DOI: 10.21437/Interspeech.2017-501.


@inproceedings{Akhtiamov2017,
  author={Oleg Akhtiamov and Maxim Sidorov and Alexey A. Karpov and Wolfgang Minker},
  title={Speech and Text Analysis for Multimodal Addressee Detection in Human-Human-Computer Interaction},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2521--2525},
  doi={10.21437/Interspeech.2017-501},
  url={http://dx.doi.org/10.21437/Interspeech.2017-501}
}