End-to-End Speaker Identification in Noisy and Reverberant Environments Using Raw Waveform Convolutional Neural Networks

Daniele Salvati, Carlo Drioli, Gian Luca Foresti


Convolutional neural network (CNN) models are being investigated extensively in the field of speech and speaker recognition, and are rapidly gaining appreciation due to their performance robustness and effective training strategies. Recently, they are also providing interesting results in end-to-end configurations using directly raw waveforms for classification, with the drawback however of being more sensible on the amount of training data. We present a raw waveform (RW) end-to-end computational scheme for speaker identification based on CNNs with noise and reverberation data augmentation (DA). The CNN is designed for a frame-to-frame analysis to handle variable-length signals. We analyze the identification performance with simulated experiments in noisy and reverberation conditions comparing the proposed RW-CNN with the mel-frequency cepstral coefficients (MFCCs) features. The results show that the method offers robustness to adverse conditions. The RW-CNN outperforms the MFCC-CNN in noise conditions, and they have similar performance in reverberant environments.


 DOI: 10.21437/Interspeech.2019-2403

Cite as: Salvati, D., Drioli, C., Foresti, G.L. (2019) End-to-End Speaker Identification in Noisy and Reverberant Environments Using Raw Waveform Convolutional Neural Networks. Proc. Interspeech 2019, 4335-4339, DOI: 10.21437/Interspeech.2019-2403.


@inproceedings{Salvati2019,
  author={Daniele Salvati and Carlo Drioli and Gian Luca Foresti},
  title={{End-to-End Speaker Identification in Noisy and Reverberant Environments Using Raw Waveform Convolutional Neural Networks}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4335--4339},
  doi={10.21437/Interspeech.2019-2403},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2403}
}