Predicting Speech Intelligibility of Enhanced Speech Using Phone Accuracy of DNN-Based ASR System

Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani, Katsuhiko Yamamoto, Toshio Irino


The ability of state-of-the-art automatic speech recognition (ASR) systems, which use deep neural networks (DNN), has recently been approaching that of human auditory systems. On the other hand, although measuring the intelligibility of enhanced speech signals is important for developing auditory algorithms and devices, the current measurement methods mainly rely on subjective experiments. Therefore, it would be preferable to employ an ASR system to predict the subjective speech intelligibility (SI) of enhanced speech. In this study, we evaluate the SI prediction performance of DNN-based ASR systems using phone accuracies. We found that an ASR system with multi-condition training achieves the best SI prediction accuracy for enhanced speech when compared with conventional methods (STOI, HASPI) and a recently proposed technique (GEDI). In addition, since our ASR system uses only a phone language model, it offers the advantage of being able to predict intelligibility independently of prior knowledge of words.


 DOI: 10.21437/Interspeech.2019-1381

Cite as: Arai, K., Araki, S., Ogawa, A., Kinoshita, K., Nakatani, T., Yamamoto, K., Irino, T. (2019) Predicting Speech Intelligibility of Enhanced Speech Using Phone Accuracy of DNN-Based ASR System. Proc. Interspeech 2019, 4275-4279, DOI: 10.21437/Interspeech.2019-1381.


@inproceedings{Arai2019,
  author={Kenichi Arai and Shoko Araki and Atsunori Ogawa and Keisuke Kinoshita and Tomohiro Nakatani and Katsuhiko Yamamoto and Toshio Irino},
  title={{Predicting Speech Intelligibility of Enhanced Speech Using Phone Accuracy of DNN-Based ASR System}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4275--4279},
  doi={10.21437/Interspeech.2019-1381},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1381}
}