14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Automatic Human Utility Evaluation of ASR Systems: Does WER Really Predict Performance?

Benoit Favre (1), Kyla Cheung (2), Siavash Kazemian (3), Adam Lee (4), Yang Liu (5), Cosmin Munteanu (3), Ani Nenkova (6), Dennis Ochei (7), Gerald Penn (3), Stephen Tratz (8), Clare Voss (8), Frauke Zeller (9)

(1) Université d'Aix-Marseille, France
(2) Columbia University, USA
(3) University of Toronto, Canada
(5) University of Texas at Dallas, USA
(6) University of Pennsylvania, USA
(7) Duke University, USA
(8) ARL, USA
(9) University College London, UK

We propose an alternative evaluation metric to Word Error Rate (WER) for the decision audit task of meeting recordings, which exemplifies how to evaluate speech recognition within a legitimate application context. Using machine learning on an initial seed of human-subject experimental data, our alternative metric handily outperforms WER, which correlates very poorly with human subjectsf success in finding decisions given ASR transcripts with a range of WERs.

Full Paper

Bibliographic reference.  Favre, Benoit / Cheung, Kyla / Kazemian, Siavash / Lee, Adam / Liu, Yang / Munteanu, Cosmin / Nenkova, Ani / Ochei, Dennis / Penn, Gerald / Tratz, Stephen / Voss, Clare / Zeller, Frauke (2013): "Automatic human utility evaluation of ASR systems: does WER really predict performance?", In INTERSPEECH-2013, 3463-3467.