In this paper we show how we can discover non-linear features of frames of spectrograms using a novel autoencoder. The autoencoder uses a neural network encoder that predicts how a set of prototypes called templates need to be transformed to reconstruct the data, and a decoder that is a function that performs this operation of transforming prototypes and reconstructing the input. We demonstrate this method on spectrograms from the TIMIT database. The features are used in a Deep Neural Network - Hidden Markov Model (DNN-HMM) hybrid system for automatic speech recognition. On the TIMIT monophone recognition task we were able to achieve gains of 0.5% over Mel log spectra, by augmenting traditional the spectra with the predicted transformation parameters. Further, using the recently discovered edropoutf training, we were able to achieve a phone error rate (PER) of 17.9% on the dev set and 19.5% on the test set, which, to our knowledge is the best reported number on this task using a hybrid system. Speaking Rate Normalization with Lattice-Based Context-Dependent Phoneme Duration Modeling for Personalized Speech Recognizers on Mobile Devices
Bibliographic reference. Jaitly, Navdeep / Hinton, Geoffrey E. (2013): "Using an autoencoder with deformable templates to discover features for automated speech recognition", In INTERSPEECH-2013, 1737-1740.