A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction

Che-Wei Huang, Roland Maas, Sri Harish Mallidi, Björn Hoffmeister


In this paper, we extend our previous work on device-directed utterance detection, which aims to distinguish voice queries intended for a smart-home device from background speech. The task can be phrased as a binary utterance-level classification problem that we approach with a DNN-LSTM model using acoustic features and features from the automatic speech recognition (ASR) decoder as input. In this work, we study the performance of the model for different dialog types and for different categories of decoder features. To address different dialog types, we found that a model with a separate output branch for each dialog type outperforms a model with a shared output branch by a relative 12.5% of equal error rate (EER) reduction. We also found the average number of arcs in a confusion network to be one of the most informative ASR decoder features. In addition, we explore different frequencies of backward propagation for training the acoustic embedding for every k frames (k=1,3,5,7), and mean and attention pooling methods for generating an utterance representation. We found that attention pooling provides the most discriminative utterance representation and outperforms mean pooling by a relative 4.97% of EER reduction.


 DOI: 10.21437/Interspeech.2019-2840

Cite as: Huang, C., Maas, R., Mallidi, S.H., Hoffmeister, B. (2019) A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction. Proc. Interspeech 2019, 3342-3346, DOI: 10.21437/Interspeech.2019-2840.


@inproceedings{Huang2019,
  author={Che-Wei Huang and Roland Maas and Sri Harish Mallidi and Björn Hoffmeister},
  title={{A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3342--3346},
  doi={10.21437/Interspeech.2019-2840},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2840}
}