RawNet: Advanced End-to-End Deep Neural Network Using Raw Waveforms for Text-Independent Speaker Verification

Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, Ha-Jin Yu


Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains. In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation. In this study, we explore end-to-end deep neural networks that input raw waveforms to improve various aspects: front-end speaker embedding extraction including model architecture, pre-training scheme, additional objective functions, and back-end classification. Adjustment of model architecture using a pre-training scheme can extract speaker embeddings, giving a significant improvement in performance. Additional objective functions simplify the process of extracting speaker embeddings by merging conventional two-phase processes: extracting utterance-level features such as i-vectors or x-vectors and the feature enhancement phase, e.g., linear discriminant analysis. Effective back-end classification models that suit the proposed speaker embedding are also explored. We propose an end-to-end system that comprises two deep neural networks, one frontend for utterance-level speaker embedding extraction and the other for back-end classification. Experiments conducted on the VoxCeleb1 dataset demonstrate that the proposed model achieves state-of-the-art performance among systems without data augmentation. The proposed system is also comparable to the state-of-the-art x-vector system that adopts heavy data augmentation.


 DOI: 10.21437/Interspeech.2019-1982

Cite as: Jung, J., Heo, H., Kim, J., Shim, H., Yu, H. (2019) RawNet: Advanced End-to-End Deep Neural Network Using Raw Waveforms for Text-Independent Speaker Verification. Proc. Interspeech 2019, 1268-1272, DOI: 10.21437/Interspeech.2019-1982.


@inproceedings{Jung2019,
  author={Jee-weon Jung and Hee-Soo Heo and Ju-ho Kim and Hye-jin Shim and Ha-Jin Yu},
  title={{RawNet: Advanced End-to-End Deep Neural Network Using Raw Waveforms for Text-Independent Speaker Verification}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1268--1272},
  doi={10.21437/Interspeech.2019-1982},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1982}
}