Impact of Aliasing on Deep CNN-Based End-to-End Acoustic Models

Yuan Gong, Christian Poellabauer


A recent trend in audio and speech processing is to learn target labels directly from raw waveforms rather than hand-crafted acoustic features. Previous work has shown that deep convolutional neural networks (CNNs) as front-end can learn effective representations from the raw waveform. However, due to the large dimension of raw audio waveforms, pooling layers are usually used aggressively between temporal convolutional layers. In essence, these pooling layers perform operations that are similar to signal downsampling, which may lead to temporal aliasing according to the Nyquist-Shannon sampling theorem. This paper explores, using a series of experiments, if and how this aliasing effect impacts modern deep CNN-based models.


 DOI: 10.21437/Interspeech.2018-1371

Cite as: Gong, Y., Poellabauer, C. (2018) Impact of Aliasing on Deep CNN-Based End-to-End Acoustic Models. Proc. Interspeech 2018, 2698-2702, DOI: 10.21437/Interspeech.2018-1371.


@inproceedings{Gong2018,
  author={Yuan Gong and Christian Poellabauer},
  title={Impact of Aliasing on Deep CNN-Based End-to-End Acoustic Models},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2698--2702},
  doi={10.21437/Interspeech.2018-1371},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1371}
}