14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Using an Autoencoder with Deformable Templates to Discover Features for Automated Speech Recognition

Navdeep Jaitly, Geoffrey E. Hinton

University of Toronto, Canada

In this paper we show how we can discover non-linear features of frames of spectrograms using a novel autoencoder. The autoencoder uses a neural network encoder that predicts how a set of prototypes called templates need to be transformed to reconstruct the data, and a decoder that is a function that performs this operation of transforming prototypes and reconstructing the input. We demonstrate this method on spectrograms from the TIMIT database. The features are used in a Deep Neural Network - Hidden Markov Model (DNN-HMM) hybrid system for automatic speech recognition. On the TIMIT monophone recognition task we were able to achieve gains of 0.5% over Mel log spectra, by augmenting traditional the spectra with the predicted transformation parameters. Further, using the recently discovered edropoutf training, we were able to achieve a phone error rate (PER) of 17.9% on the dev set and 19.5% on the test set, which, to our knowledge is the best reported number on this task using a hybrid system. Speaking Rate Normalization with Lattice-Based Context-Dependent Phoneme Duration Modeling for Personalized Speech Recognizers on Mobile Devices

Full Paper

Bibliographic reference.  Jaitly, Navdeep / Hinton, Geoffrey E. (2013): "Using an autoencoder with deformable templates to discover features for automated speech recognition", In INTERSPEECH-2013, 1737-1740.