Acoustic Modeling from Frequency Domain Representations of Speech

Pegah Ghahremani, Hossein Hadian, Hang Lv, Daniel Povey, Sanjeev Khudanpur


In recent years, different studies have proposed new methods for DNN-based feature extraction and joint acoustic model training and feature learning from raw waveform for large vocabulary speech recognition. However, conventional pre-processed methods such as MFCC and PLP are still preferred in the state-of-the-art speech recognition systems as they are perceived to be more robust. Besides, the raw waveform methods - most of which are based on the time-domain signal - do not significantly outperform the conventional methods. In this paper, we propose a frequency-domain feature-learning layer which can allow acoustic model training directly from the waveform. The main distinctions from previous works are a new normalization block and a short-range constraint on the filter weights. The proposed setup achieves consistent performance improvements compared to the baseline MFCC and log-Mel features as well as other proposed time and frequency domain setups on different LVCSR tasks. Finally, based on the learned filters in our feature-learning layer, we propose a new set of analytic filters using polynomial approximation, which outperforms log-Mel filters significantly while being equally fast.


 DOI: 10.21437/Interspeech.2018-1453

Cite as: Ghahremani, P., Hadian, H., Lv, H., Povey, D., Khudanpur, S. (2018) Acoustic Modeling from Frequency Domain Representations of Speech. Proc. Interspeech 2018, 1596-1600, DOI: 10.21437/Interspeech.2018-1453.


@inproceedings{Ghahremani2018,
  author={Pegah Ghahremani and Hossein Hadian and Hang Lv and Daniel Povey and Sanjeev Khudanpur},
  title={Acoustic Modeling from Frequency Domain Representations of Speech},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1596--1600},
  doi={10.21437/Interspeech.2018-1453},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1453}
}