10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Towards Fusion of Feature Extraction and Acoustic Model Training: A Top Down Process for Robust Speech Recognition

Yu-Hsiang Bosco Chiu, Bhiksha Raj, Richard M. Stern

Carnegie Mellon University, USA

This paper presents a strategy to learn physiologically-motivated components in a feature computation module discriminatively, directly from data, in a manner that is inspired by the presence of efferent processes in the human auditory system. In our model a set of logistic functions which represent the rate-level nonlinearities found in most mammal hearing system are put in as part of the feature extraction process. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in the training data. The estimated feature computation is observed to be robust against environmental noise. Experiments conducted with the CMU Sphinx-III on the DARPA Resource Management task show that the discriminatively estimated rate-nonlinearity results in better performance in the presence of background noise than traditional procedures which separate the feature extraction and model training into two distinct parts without feed back from the latter to the former.

Full Paper

Bibliographic reference.  Chiu, Yu-Hsiang Bosco / Raj, Bhiksha / Stern, Richard M. (2009): "Towards fusion of feature extraction and acoustic model training: a top down process for robust speech recognition", In INTERSPEECH-2009, 32-35.