Word error rate or character error rate are usually used as the metrics for evaluating the accuracy of speech recognition. These are naturally-defined objective metrics and are helpful for comparing recognition methods fairly. However the overall performance of the recognition systems and the usefulness of the results are not necessarily considered. To address this problem, we study and propose a metric which replicates human-annotated scores using their perception to the recognition results. The features that we use are the numbers of insertion errors, deletion errors, and substitution errors in the characters and the syllables. In addition we studied the numbers of consecutive errors, the misrecognized keywords, and the locations of errors. We created models using linear regression and random forest, predicted human-perceived scores, and compared them with the actual scores using Spearman's rank-based correlation. According to our experiments the correlation of human perceived scores with character error rates is 0.456, while those with the predicted scores by using a random forest of 10 features is 0.715. The latter is close to the averaged correlation between the scores of the human subjects, 0.765, which suggests that we can predict the human-perceived scores using those features and that we can leverage human perception model for evaluating speech recognition performance. The important factors (features) for the prediction are the numbers of substitution errors and consecutive errors.
Bibliographic reference. Itoh, Nobuyasu / Kurata, Gakuto / Tachibana, Ryuki / Nishimura, Masafumi (2015): "A metric for evaluating speech recognizer output based on human-perception model", In INTERSPEECH-2015, 1285-1288.