INTERSPEECH 2009
10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Noise Robustness of Tract Variables and their Application to Speech Recognition

Vikramjit Mitra (1), Hosung Nam (2), Carol Y. Espy-Wilson (1), Elliot Saltzman (2), Louis Goldstein (3)

(1) University of Maryland at College Park, USA
(2) Haskins Laboratories, USA
(3) University of Southern California, USA

This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech.

Full Paper

Bibliographic reference.  Mitra, Vikramjit / Nam, Hosung / Espy-Wilson, Carol Y. / Saltzman, Elliot / Goldstein, Louis (2009): "Noise robustness of tract variables and their application to speech recognition", In INTERSPEECH-2009, 2759-2762.