This work compares the performance of deep Locally-Connected Networks (LCN) and Convolutional Neural Networks (CNN) for text-dependent speaker recognition. These topologies model the local time-frequency correlations of the speech signal better, using only a fraction of the number of parameters of a fully connected Deep Neural Network (DNN) used in previous works. We show that both a LCN and CNN can reduce the total model footprint to 30% of the original size compared to a baseline fully-connected DNN, with minimal impact in performance or latency. In addition, when matching parameters, the LCN improves speaker verification performance, as measured by equal error rate (EER), by 8% relative over the baseline without increasing model size or computation. Similarly, a CNN improves EER by 10% relative over the baseline for the same model size but with increased computation.
Cite as: Chen, Y.-h., Lopez-Moreno, I., Sainath, T.N., Visontai, M., Alvarez, R., Parada, C. (2015) Locally-connected and convolutional neural networks for small footprint speaker recognition. Proc. Interspeech 2015, 1136-1140, doi: 10.21437/Interspeech.2015-297
@inproceedings{chen15h_interspeech, author={Yu-hsin Chen and Ignacio Lopez-Moreno and Tara N. Sainath and Mirkó Visontai and Raziel Alvarez and Carolina Parada}, title={{Locally-connected and convolutional neural networks for small footprint speaker recognition}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={1136--1140}, doi={10.21437/Interspeech.2015-297} }