Many believe that comparisons of machine and human speech recognition could help determine both the room for and the direction of improvement for speech recognizers. Yet, such experiments are made quite rarely or over such complex domains where instructive conclusions are hard to draw. In this paper we attempt to measure human performance on the tasks of the acoustic and language models of ASR systems separately. To simulate the task of acoustic decoding, subjects were instructed to phonetically transcribe short nonsense sentences. Here, besides the well-known superior segment classification, we also observed a good performance in word segmentation. To imitate higher-level processing, the subjects had to correct deliberately corrupted texts. Here we found that humans can achieve a word accuracy of about 80% even when almost one third of the phonemes are incorrect, and that with word boundary position information the word error rate roughly halves.
Bibliographic reference. Tóth, László (2007): "Benchmarking human performance on the acoustic and linguistic subtasks of ASR systems", In INTERSPEECH-2007, 382-385.