This paper studies the relationship between word error rate (WER) and keyword error rate (KER) in speech transcripts and their effect on the performance of speech analytics applications. Automatic speech recognition (ASR) systems are increasingly used as input for speech analytics, which raises the question of whether WER or KER is the more suitable performance metric for calibrating the ASR system. ASR systems are typically evaluated in terms of WER. Many speech analytics applications, however, rely on identifying keywords in the transcripts - thus their performance can be expected to be more sensitive to keyword errors than regular word errors.
To study this question, we conduct a case study using an experimental data set comprising 100 calls to a contact center. We first automatically extract domain-specific words from the manual transcription and use this set of words to calculate keyword error rates in the following experiments. We then generate call transcripts with the IBM Attila speech recognition system, using different training for each repetition to generate transcripts with a range of word error rates. The transcripts are then processed with two speech analytics applications, call section segmentation and topic categorization. The results show similar WER and KER in high-accuracy transcripts, but KER increases more rapidly than WER as the accuracy of the transcription deteriorates. Neither speech analytics application showed significant sensitivity to the increase in KER for low-accuracy transcripts. Thus this case study did not identify a significant difference between using WER and KER.
Bibliographic reference. Park, Youngja / Patwardhan, Siddharth / Visweswariah, Karthik / Gates, Stephen C. (2008): "An empirical analysis of word error rate and keyword error rate", In INTERSPEECH-2008, 2070-2073.