15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Identifying the Human-Machine Differences in Complex Binaural Scenes: What Can Be Learned from Our Auditory System

Constantin Spille, Bernd T. Meyer

Carl von Ossietzky Universität Oldenburg, Germany

Previous comparisons of human speech recognition (HSR) and automatic speech recognition (ASR) focused on monaural signals in additive noise, and showed that HSR is far more robust against intrinsic and extrinsic sources of variation than conventional ASR. The aim of this study is to analyze the man-machine gap (and its causes) in more complex acoustic scenarios, particularly in scenes with two moving speakers, reverberation and diffuse noise. Responses of nine normal-hearing listeners are compared to errors of an ASR system that employs a binaural model for direction-of-arrival estimation and beamforming for signal enhancement. The overall man-machine gap is measured in terms for the speech recognition threshold (SRT), i.e., the signal-to-noise ratio at which a 50% recognition rate is obtained. The comparison shows that the gap amounts to 16.7 dB SRT difference which exceeds the difference of 10 dB found in monaural situations. Based on cross comparisons that use oracle knowledge (e.g., the speakers' true position), incorrect responses are attributed to localization errors (7 dB) or missing spectral information to distinguish between speakers with different gender (3 dB). The comparison hence identifies specific ASR components that can profit from learning from binaural auditory signal processing.

Full Paper

Bibliographic reference.  Spille, Constantin / Meyer, Bernd T. (2014): "Identifying the human-machine differences in complex binaural scenes: what can be learned from our auditory system", In INTERSPEECH-2014, 626-630.