Comparison of Unsupervised Modulation Filter Learning Methods for ASR

Purvi Agrawal, Sriram Ganapathy

The widespread deployment of automatic speech recognition (ASR) system in consumer centric applications such as voice interaction and voice search demands the need for noise robustness in such systems. One approach to this problem is to achieve the desired robustness in speech representations used in the ASR. Motivated from studies on robust human speech recognition, we analyse the unsupervised data-driven temporal modulation filter learning for robust feature extraction. In this paper, we compare various unsupervised models for data driven filter learning like convolutional autoencoder (CAE), generative adversarial network (GAN) and convolutional restricted Boltzmann machine (CRBM). The unsupervised models are designed to learn a set of filters from long temporal trajectories of speech sub-band energy. The filters learnt from these models are used for modulation filtering of the input spectrogram before the ASR training. The ASR experiments are performed on Wall Street Journal (WSJ) Aurora-4 database with clean and multi condition training setup. The experimental results obtained from the modulation filtered representations shows considerable robustness to noise, channel distortions and reverberant conditions compared to other feature extraction methods. Among the three approaches compared in this paper, the GAN approach provides the most consistent improvements in ASR accuracy in different training scenarios.

 DOI: 10.21437/Interspeech.2018-1972

Cite as: Agrawal, P., Ganapathy, S. (2018) Comparison of Unsupervised Modulation Filter Learning Methods for ASR. Proc. Interspeech 2018, 2908-2912, DOI: 10.21437/Interspeech.2018-1972.

  author={Purvi Agrawal and Sriram Ganapathy},
  title={Comparison of Unsupervised Modulation Filter Learning Methods for ASR},
  booktitle={Proc. Interspeech 2018},