DNN-Based Automatic Speech Recognition as a Model for Human Phoneme Perception

Mats Exter, Bernd T. Meyer


In this paper, we test the applicability of state-of-the-art automatic speech recognition (ASR) to predict phoneme confusions in human listeners. Phoneme-specific response rates are obtained from ASR based on deep neural networks (DNNs) and from listening tests with six normal-hearing subjects. The measure for model quality is the correlation of phoneme recognition accuracies obtained in ASR and in human speech recognition (HSR). Various feature representations are used as input to the DNNs to explore their relation to overall ASR performance and model prediction power. Standard filterbank output and perceptual linear prediction (PLP) features result in best predictions, with correlation coefficients reaching r = 0.9.


DOI: 10.21437/Interspeech.2016-1285

Cite as

Exter, M., Meyer, B.T. (2016) DNN-Based Automatic Speech Recognition as a Model for Human Phoneme Perception. Proc. Interspeech 2016, 615-619.

Bibtex
@inproceedings{Exter+2016,
author={Mats Exter and Bernd T. Meyer},
title={DNN-Based Automatic Speech Recognition as a Model for Human Phoneme Perception},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1285},
url={http://dx.doi.org/10.21437/Interspeech.2016-1285},
pages={615--619}
}