12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Feature Normalization Using Structured Full Transforms for Robust Speech Recognition

Xiong Xiao (1), Jinyu Li (2), Eng Siong Chng (1), Haizhou Li (3)

(1) Nanyang Technological University, Singapore
(2) Microsoft Corporation, USA
(3) A*STAR, Singapore

Classical mean and variance normalization (MVN) uses a diagonal transform and a bias vector to normalize the mean and variance of noisy features to reference values. As MVN uses diagonal transform, it ignores correlation between feature dimensions. Although full transform is able to make use of feature correlation, its large amount of parameters may not be estimated reliably from a short observation, e.g. 1 utterance. We propose a novel structured full transform that has the same amount of free parameters as diagonal transform while being able to capture correlation between feature dimensions. The proposed structured transform can be estimated reliably from one utterance by maximizing the likelihood of the normalized features on a reference Gaussian mixture model. Experimental results on Aurora- 4 task show that the structured transform produces consistently better speech recognition results than diagonal transform and also outperforms advanced frontend (AFE) feature extractor.

Full Paper

Bibliographic reference.  Xiao, Xiong / Li, Jinyu / Chng, Eng Siong / Li, Haizhou (2011): "Feature normalization using structured full transforms for robust speech recognition", In INTERSPEECH-2011, 693-696.