We investigated a single-ended speech intelligibility estimation method that does not require clean speech reference signal, using the features defined in the ITU-T standard P.563. We selected two sets of features from the P.563 features; the basic nine feature set, and the extended 31 feature set with 22 additional features for more accurate description of the degraded speech. Four hundred noise samples were added to speech, and about 70% of these samples were used to extract the feature sets to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility for speech degraded with the remaining 30% of unknown noise samples. The proposed method showed a root mean square error (RMSE) value of about 0.16 and correlation with subjective intelligibility of about 0.84 for speech distorted with unknown noise with either of the feature set. These results were higher than the double-sided estimation using frequency-weighed SNR calculated in critical frequency bands, which require the clean reference signal. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.
Bibliographic reference. Sakano, Toshihiro / Kobayashi, Yosuke / Kondo, Kazuhiro (2014): "Single-ended estimation of speech intelligibility using the ITU p.563 feature set", In INTERSPEECH-2014, 2031-2035.