ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition

Magdalena Rybicka, Jesús Villalba, Piotr Żelasko, Najim Dehak, Konrad Kowalczyk

Modeling speaker embeddings using deep neural networks is currently state-of-the-art in speaker recognition. Recently, ResNet-based structures have gained a broader interest, slowly becoming the baseline along with the deep-rooted Time Delay Neural Network based models. However, the scale-decreased design of the ResNet models may not preserve all of the speaker information. In this paper, we investigate the SpineNet structure with scale-permuted design to tackle this problem, in which feature size either increases or decreases depending on the processing stage in the network. Apart from the presented adjustments of the SpineNet model for the speaker recognition task, we also incorporate popular modules dedicated to the residual-like structures, namely the Res2Net and Squeeze-and-Excitation blocks, and modify them to work effectively in the presented neural network architectures. The final proposed model, i.e., the SpineNet architecture with Res2Net and Time-Squeeze-and-Excitation blocks, achieves remarkable Equal Error Rates of 0.99 and 0.92 for the Extended and Original trial lists of the well-known VoxCeleb1 dataset.


doi: 10.21437/Interspeech.2021-1163

Cite as: Rybicka, M., Villalba, J., Żelasko, P., Dehak, N., Kowalczyk, K. (2021) Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition. Proc. Interspeech 2021, 496-500, doi: 10.21437/Interspeech.2021-1163

@inproceedings{rybicka21_interspeech,
  author={Magdalena Rybicka and Jesús Villalba and Piotr Żelasko and Najim Dehak and Konrad Kowalczyk},
  title={{Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={496--500},
  doi={10.21437/Interspeech.2021-1163}
}