ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Revisiting Parity of Human vs. Machine Conversational Speech Transcription

Courtney Mansfield, Sara Ng, Gina-Anne Levow, Richard A. Wright, Mari Ostendorf

A number of studies have compared human and machine transcription, showing that automatic speech recognition (ASR) is approaching human performance in some contexts. Most studies look at differences as measured by the standard speech recognition scoring criterion: word error rate (WER). This study looks at more fine-grained analysis of differences for conversational speech data where systems have reached human parity in terms of average WER, specifically insertions vs. deletions, word category, and word context characterized by linguistic surprisal. In contrast to ASR systems, humans are more likely to miss words than to misrecognize them, and they are much more likely to make errors in transcribing words associated primarily with conversational contexts (fillers, backchannels and discourse cue words). The differences are more pronounced for more informal contexts, i.e. conversations between family members. Although human transcribers may miss these words, conversational partners seem to use them in turntaking and processing disfluencies. Thus, ASR systems may need superhuman transcription performance for spoken language technology to achieve human-level conversation skills.

doi: 10.21437/Interspeech.2021-1908

Cite as: Mansfield, C., Ng, S., Levow, G.-A., Wright, R.A., Ostendorf, M. (2021) Revisiting Parity of Human vs. Machine Conversational Speech Transcription. Proc. Interspeech 2021, 1997-2001, doi: 10.21437/Interspeech.2021-1908

  author={Courtney Mansfield and Sara Ng and Gina-Anne Levow and Richard A. Wright and Mari Ostendorf},
  title={{Revisiting Parity of Human vs. Machine Conversational Speech Transcription}},
  booktitle={Proc. Interspeech 2021},