Anchored Speech Detection

Roland Maas, Sree Hari Krishnan Parthasarathi, Brian King, Ruitong Huang, Björn Hoffmeister


We propose two new methods of speech detection in the context of voice-controlled far-field appliances. While conventional detection methods are designed to differentiate between speech and nonspeech, we aim at distinguishing desired speech, which we define as speech originating from the person interacting with the device, from background noise and interfering talkers. Our two proposed methods use the first word spoken by the desired talker, the “anchor” word, as a reference to learn characteristics about that speaker. In the first method, we estimate the mean of the anchor word segment and subtract it from the subsequent feature vectors. In the second, we use an encoder-decoder network with features that are normalized by applying conventional log amplitude causal mean subtraction. The experimental results reveal that both techniques achieve around 10% relative reduction in frame classification error rate over a baseline feed-forward network with conventionally normalized features.


DOI: 10.21437/Interspeech.2016-1346

Cite as

Maas, R., Parthasarathi, S.H.K., King, B., Huang, R., Hoffmeister, B. (2016) Anchored Speech Detection. Proc. Interspeech 2016, 2963-2967.

Bibtex
@inproceedings{Maas+2016,
author={Roland Maas and Sree Hari Krishnan Parthasarathi and Brian King and Ruitong Huang and Björn Hoffmeister},
title={Anchored Speech Detection},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1346},
url={http://dx.doi.org/10.21437/Interspeech.2016-1346},
pages={2963--2967}
}