13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Hermitian Based Hidden Activation Functions for Adaptation of Hybrid HMM/ANN Models

Sabato Marco Siniscalchi (1,3), Jinyu Li (2), Chin-Hui Lee (3)

(1) Faculty of Telematics Engineering, Kore University of Enna, Enna, Italy
(2) Microsoft Corporation, Redmond, WA, USA
(3) School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA

This work is concerned with speaker adaptation techniques for artificial neural network (ANN) implemented as feed forward multi-layer perceptrons (MLPs) in the context of large vocabulary continuous speech recognition (LVCSR). Most successful speaker adaptation techniques for MLPs consist of augmenting the neural architecture with a linear transformation network connected to either the input or the output layer. The weights of this additional linear layer are learned during the adaptation phase while all of the other weights are kept frozen in order to avoid over-fitting. In doing so, the structure of the speaker-dependent (SD) and speaker-independent (SI) architecture differs and the number of adaptation parameters depends upon the dimension of either the input or output layers. We propose a more flexible neural architecture for speaker-adaptation to overcome the limits of current approaches. This flexibility is achieved by adopting hidden activation functions that can be learned directly from the adaptation data. This adaptive capability of the hidden activation function is achieved through the use of orthonormal Hermite polynomials. Experimental evidence gathered on the Nov92 task demonstrates the viability of the proposed technique.

Index Terms: Connectionist speech recognition systems, Neural networks, Adaptation algorithms, Speech recognition

Full Paper

Bibliographic reference.  Siniscalchi, Sabato Marco / Li, Jinyu / Lee, Chin-Hui (2012): "Hermitian based hidden activation functions for adaptation of hybrid HMM/ANN models", In INTERSPEECH-2012, 2590-2593.