Unsupervised Raw Waveform Representation Learning for ASR

Purvi Agrawal, Sriram Ganapathy


In this paper, we propose a deep representation learning approach using the raw speech waveform in an unsupervised learning paradigm. The first layer of the proposed deep model performs acoustic filtering while the subsequent layer performs modulation filtering. The acoustic filterbank is implemented using cosine-modulated Gaussian filters whose parameters are learned. The modulation filtering is performed on log transformed outputs of the first layer and this is achieved using a skip connection based architecture. The outputs from this two layer filtering are fed to the variational autoencoder model. All the model parameters including the filtering layers are learned using the VAE cost function. We employ the learned representations (second layer outputs) in a speech recognition task. Experiments are conducted on Aurora-4 (additive noise with channel artifact) and CHiME-3 (additive noise with reverberation) databases. In these experiments, the learned representations from the proposed framework provide significant improvements in ASR results over the baseline filterbank features and other robust front-ends (average relative improvements of 16% and 6% in word error rate over baseline features on clean and multi-condition training, respectively on Aurora-4 dataset, and 21% over the baseline features on CHiME-3 database).


 DOI: 10.21437/Interspeech.2019-2652

Cite as: Agrawal, P., Ganapathy, S. (2019) Unsupervised Raw Waveform Representation Learning for ASR. Proc. Interspeech 2019, 3451-3455, DOI: 10.21437/Interspeech.2019-2652.


@inproceedings{Agrawal2019,
  author={Purvi Agrawal and Sriram Ganapathy},
  title={{Unsupervised Raw Waveform Representation Learning for ASR}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3451--3455},
  doi={10.21437/Interspeech.2019-2652},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2652}
}