Robust Speech Recognition via Anchor Word Representations

Brian King, I-Fan Chen, Yonatan Vaizman, Yuzong Liu, Roland Maas, Sree Hari Krishnan Parthasarathi, Björn Hoffmeister


A challenge for speech recognition for voice-controlled household devices, like the Amazon Echo or Google Home, is robustness against interfering background speech. Formulated as a far-field speech recognition problem, another person or media device in proximity can produce background speech that can interfere with the device-directed speech. We expand on our previous work on device-directed speech detection in the far-field speech setting and introduce two approaches for robust acoustic modeling. Both methods are based on the idea of using an anchor word taken from the device directed speech. Our first method employs a simple yet effective normalization of the acoustic features by subtracting the mean derived over the anchor word. The second method utilizes an encoder network projecting the anchor word onto a fixed-size embedding, which serves as an additional input to the acoustic model. The encoder network and acoustic model are jointly trained. Results on an in-house dataset reveal that, in the presence of background speech, the proposed approaches can achieve up to 35% relative word error rate reduction.


 DOI: 10.21437/Interspeech.2017-1570

Cite as: King, B., Chen, I., Vaizman, Y., Liu, Y., Maas, R., Parthasarathi, S.H.K., Hoffmeister, B. (2017) Robust Speech Recognition via Anchor Word Representations. Proc. Interspeech 2017, 2471-2475, DOI: 10.21437/Interspeech.2017-1570.


@inproceedings{King2017,
  author={Brian King and I-Fan Chen and Yonatan Vaizman and Yuzong Liu and Roland Maas and Sree Hari Krishnan Parthasarathi and Björn Hoffmeister},
  title={Robust Speech Recognition via Anchor Word Representations},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2471--2475},
  doi={10.21437/Interspeech.2017-1570},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1570}
}