Acoustic Modelling from the Signal Domain Using CNNs

Pegah Ghahremani, Vimal Manohar, Daniel Povey, Sanjeev Khudanpur


Most speech recognition systems use spectral features based on fixed filters, such as MFCC and PLP. In this paper, we show that it is possible to achieve state of the art results by making the feature extractor a part of the network and jointly optimizing it with the rest of the network. The basic approach is to start with a convolutional layer that operates on the signal (say, with a step size of 1.25 milliseconds), and aggregate the filter outputs over a portion of the time axis using a network in network architecture, and then down-sample to every 10 milliseconds for use by the rest of the network. We find that, unlike some previous work on learned feature extractors, the objective function converges as fast as for a network based on traditional features.

Because we found that iVector adaptation is less effective in this framework, we also experiment with a different adaptation method that is part of the network, where activation statistics over a medium time span (around a second) are computed at intermediate layers. We find that the resulting ‘direct-from-signal’ network is competitive with our state of the art networks based on conventional features with iVector adaptation.


DOI: 10.21437/Interspeech.2016-1495

Cite as

Ghahremani, P., Manohar, V., Povey, D., Khudanpur, S. (2016) Acoustic Modelling from the Signal Domain Using CNNs. Proc. Interspeech 2016, 3434-3438.

Bibtex
@inproceedings{Ghahremani+2016,
author={Pegah Ghahremani and Vimal Manohar and Daniel Povey and Sanjeev Khudanpur},
title={Acoustic Modelling from the Signal Domain Using CNNs},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1495},
url={http://dx.doi.org/10.21437/Interspeech.2016-1495},
pages={3434--3438}
}