We propose an alternative evaluation metric to Word Error Rate (WER) for the decision audit task of meeting recordings, which exemplifies how to evaluate speech recognition within a legitimate application context. Using machine learning on an initial seed of human-subject experimental data, our alternative metric handily outperforms WER, which correlates very poorly with human subjectsf success in finding decisions given ASR transcripts with a range of WERs.
Bibliographic reference. Favre, Benoit / Cheung, Kyla / Kazemian, Siavash / Lee, Adam / Liu, Yang / Munteanu, Cosmin / Nenkova, Ani / Ochei, Dennis / Penn, Gerald / Tratz, Stephen / Voss, Clare / Zeller, Frauke (2013): "Automatic human utility evaluation of ASR systems: does WER really predict performance?", In INTERSPEECH-2013, 3463-3467.