In this paper we develop a physiologically motivated model of peripheral auditory processing and evaluate how the different processing steps influence automatic speech recognition in noise. The model features large dynamic compression (>60 dB) and a realistic sensory cell model. The compression range was well matched to the limited dynamic range of the sensory cells and the model yielded surprisingly high recognition scores. We also developed a computationally efficient simplified model of auditory processing and found that a model of adaptation could improve recognition accuracy. Adaptation is a basic principle of neuronal processing, which accentuates signal onsets. Applying this adaptation model to melfrequency cepstral coefficient (MFCC) feature extraction enhanced recognition accuracy in noise (AURORA 2 task, averaged recognition scores) from 56.4% to 75.6% (clean training condition), a relative improvement of 41% in word error rate. Adaptation outperformed RASTA processing by more than 10%, which corresponds to a relative improvement of 31%.
Cite as: Hemmert, W., Holmberg, M., Gelbart, D. (2004) Auditory-based automatic speech recognition. Proc. ITRW on Statistical and Perceptual Audio Processing (SAPA 2004), paper 74
@inproceedings{hemmert04_sapa, author={Werner Hemmert and Marcus Holmberg and David Gelbart}, title={{Auditory-based automatic speech recognition}}, year=2004, booktitle={Proc. ITRW on Statistical and Perceptual Audio Processing (SAPA 2004)}, pages={paper 74} }