In this paper we present advances in the modeling of the masking behavior of the Human Auditory System to enhance the robustness of the feature extraction stage in Automatic Speech Recognition. The solution adopted is based on a non-linear filtering of a spectro-temporal representation applied simultaneously on both the frequency and time domains, by processing it using mathematical morphology operations as if it were an image. A particularly important component of this architecture is the so called structuring element: biologically-based considerations are addressed in the present contribution to design an element that closely resembles the masking phenomena taking place in the cochlea. The second feature of this contribution is the choice of underlying spectro-temporal representation. The best results were achieved by the representation introduced as part of the Power Normalized Cepstral Coefficients together with a spectral subtraction step. On the Aurora 2 noisy continuous digits task, we report relative error reductions of 18.7% compared to PNCC and 39.5% compared to MFCC.
Bibliographic reference. de-la-Calle-Silos, F. / Valverde-Albacete, F. J. / Gallardo-Antolín, A. / Peláez-Moreno, C. (2014): "ASR feature extraction with morphologically-filtered power-normalized cochleograms", In INTERSPEECH-2014, 2430-2434.