Noise robustness remains a challenging problem in on-device keyword spotting. Using multiple-microphone algorithms like beamforming improves accuracy, but it inevitably pushes up computational complexity and tends to require more memory. In this paper, we propose a new neural-network based architecture which takes multiple microphone signals as inputs. It can achieve better accuracy and incurs just a minimum increase in model size. Compared with a single-channel baseline which runs in parallel on each channel, the proposed architecture reduces the false reject (FR) rate by 36.3% and 46.4% relative on dual-microphone clean and noisy test sets, respectively, at a fixed false accept rate.
Cite as: Wu, J., Huang, Y., Park, H.-J., Subrahmanya, N., Violette, P. (2020) Small Footprint Multi-channel Keyword Spotting. Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 391-395, doi: 10.21437/Odyssey.2020-55
@inproceedings{wu20_odyssey, author={Jilong Wu and Yiteng Huang and Hyun-Jin Park and Niranjan Subrahmanya and Patrick Violette}, title={{Small Footprint Multi-channel Keyword Spotting}}, year=2020, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2020)}, pages={391--395}, doi={10.21437/Odyssey.2020-55} }