Recently, extracting speaker embedding directly from raw waveform has drawn increasing attention in the field of speaker verification. Parametric real-valued filters in the first convolutional layer are learned to transform the waveform into time-frequency representations. However, these methods only focus on the magnitude spectrum and the poor interpretability of the learned filters limits the performance. In this paper, we propose a complex speaker embedding extractor, named ICSpk, with higher interpretability and fewer parameters. Specifically, at first, to quantify the speaker-related frequency response of waveform, we modify the original short-term Fourier transform filters into a family of complex exponential filters, named interpretable complex (IC) filters. Each IC filter is confined by a complex exponential filter parameterized by frequency. Then, a deep complex-valued speaker embedding extractor is designed to operate on the complex-valued output of IC filters. The proposed ICSpk is evaluated on VoxCeleb and CNCeleb databases. Experimental results demonstrate the IC filters-based system exhibits a significant improvement over the complex spectrogram based systems. Furthermore, the proposed ICSpk outperforms existing raw waveform based systems by a large margin.
Cite as: Peng, J., Qu, X., Wang, J., Gu, R., Xiao, J., Burget, L., Černocký, J. (2021) ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform. Proc. Interspeech 2021, 511-515, doi: 10.21437/Interspeech.2021-2016
@inproceedings{peng21_interspeech, author={Junyi Peng and Xiaoyang Qu and Jianzong Wang and Rongzhi Gu and Jing Xiao and Lukáš Burget and Jan Černocký}, title={{ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={511--515}, doi={10.21437/Interspeech.2021-2016} }