Listening in the Dips: Comparing Relevant Features for Speech Recognition in Humans and Machines

Constantin Spille, Bernd T. Meyer


In recent years, automatic speech recognition (ASR) systems gradually decreased (and for some tasks closed) the gap between human and automatic speech recognition. However, it is unclear if similar performance implies humans and ASR systems to rely on similar signal cues. In the current study, ASR and HSR are compared using speech material from a matrix sentence test mixed with either a stationary speech-shaped noise (SSN) or amplitude-modulated SSN. Recognition performance of HSR and ASR is measured in term of the speech recognition threshold (SRT), i.e., the signal-to-noise ratio with 50% recognition rate and by comparing psychometric functions. ASR results are obtained with matched-trained DNN-based systems that use FBank features as input and compared to results obtained from eight normal-hearing listeners and two established models of speech intelligibility. For both maskers, HSR and ASR achieve similar SRTs with an average deviation of only 0.4 dB. A relevance propagation algorithm is applied to identify features relevant for ASR. The analysis shows that relevant features coincide either with spectral peaks of the speech signal or with dips of the noise masker, indicating that similar cues are important in HSR and ASR.


 DOI: 10.21437/Interspeech.2017-1168

Cite as: Spille, C., Meyer, B.T. (2017) Listening in the Dips: Comparing Relevant Features for Speech Recognition in Humans and Machines. Proc. Interspeech 2017, 2968-2972, DOI: 10.21437/Interspeech.2017-1168.


@inproceedings{Spille2017,
  author={Constantin Spille and Bernd T. Meyer},
  title={Listening in the Dips: Comparing Relevant Features for Speech Recognition in Humans and Machines},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2968--2972},
  doi={10.21437/Interspeech.2017-1168},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1168}
}