Simultaneous Detection and Localization of a Wake-Up Word Using Multi-Task Learning of the Duration and Endpoint

Takashi Maekaku, Yusuke Kida, Akihiko Sugiyama


This paper proposes a novel method for simultaneous detection and localization of a wake-up word using multi-task learning of the duration and endpoint. An onset of the wake-up word is estimated by going back in time by an estimated duration of the wake-up word from an estimated endpoint. Accurate endpoint estimation is achieved by training the network to fire only at the endpoint in contrast to the entire wake-up word. The accurate endpoint naturally leads to an accurate onset, when it is used as a basis to calculate an onset with an estimated duration that reflects the whole acoustic information over the entire wake-up word. Experimental results with real-environment data show that a relative improvement in accuracy of 41% for onset estimation and 38% for endpoint estimation are achieved compared to a baseline method.


 DOI: 10.21437/Interspeech.2019-1180

Cite as: Maekaku, T., Kida, Y., Sugiyama, A. (2019) Simultaneous Detection and Localization of a Wake-Up Word Using Multi-Task Learning of the Duration and Endpoint. Proc. Interspeech 2019, 4240-4244, DOI: 10.21437/Interspeech.2019-1180.


@inproceedings{Maekaku2019,
  author={Takashi Maekaku and Yusuke Kida and Akihiko Sugiyama},
  title={{Simultaneous Detection and Localization of a Wake-Up Word Using Multi-Task Learning of the Duration and Endpoint}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4240--4244},
  doi={10.21437/Interspeech.2019-1180},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1180}
}