Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition

Cong-Thanh Do, Yannis Stylianou


This paper proposes a new method for weighting two-dimensional (2D) time-frequency (T-F) representation of speech using auditory saliency for noise-robust automatic speech recognition (ASR). Auditory saliency is estimated via 2D auditory saliency maps which model the mechanism for allocating human auditory attention. These maps are used to weight T-F representation of speech, namely the 2D magnitude spectrum or spectrogram, prior to features extraction for ASR. Experiments on Aurora-4 corpus demonstrate the effectiveness of the proposed method for noise-robust ASR. In multi-stream ASR, relative word error rate (WER) reduction of up to 5.3% and 4.0% are observed when comparing the multi-stream system using the proposed method with the baseline single-stream system not using T-F representation weighting and that using conventional spectral masking noise-robust technique, respectively. Combining the multi-stream system using the proposed method and the single-stream system using the conventional spectral masking technique reduces further the WER.


 DOI: 10.21437/Interspeech.2018-1721

Cite as: Do, C., Stylianou, Y. (2018) Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition. Proc. Interspeech 2018, 1591-1595, DOI: 10.21437/Interspeech.2018-1721.


@inproceedings{Do2018,
  author={Cong-Thanh Do and Yannis Stylianou},
  title={Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1591--1595},
  doi={10.21437/Interspeech.2018-1721},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1721}
}