The comparison of human speech recognition (HSR) and machine performance allows to learn from the differences between HSR and automatic speech recognition (ASR) and serves as motivation for using auditory-inspired strategies in ASR. The recognition of noisy digit strings from the Aurora 2 framework is one of the most widely used tasks in the ASR community. This paper establishes a baseline with a close-to-optimal classifier, i.e., our auditory system by comparing results from 10 normal-hearing listeners to the Aurora 2 reference system using identical speech material. The baseline ASR system reaches the human performance level only when the signal-to-noise ratio is increased by 10 or 21 dB depending on the training condition. The recognition of 1-digit recordings was found to be considerably better for HSR, indicating that onset detection is an important feature neglected in standard ASR systems. Results of recent studies are considered in the light of these findings to measure how far we have come on the way to human speech recognition performance.
Bibliographic reference. Meyer, Bernd T. (2013): "What's the difference? comparing humans and machines on the Aurora 2 speech recognition task", In INTERSPEECH-2013, 2634-2638.