4th International Conference on Spoken Language Processing
Philadelphia, PA, USA
The gap between human and machine performance on speech recognition tasks is still very large. Recognition of words in telephone conversations is slightly better than 50%, based on results reported on the Switchboard corpus by leading researchers using state of the art HMM systems. We know from our own experience that human perception typically delivers much more accurate word recognition over the telephone. Why is there such a large gap between machine and human performance, and what can be done to close this gap? One way to address this question is to study the sources of linguistic information in the speech signal that are known to be important for word recognition, and measure how well machine systems utilize this information relative to humans. As an initial step in this direction, we measured word recognition performance of listeners presented with words from the Switchboard corpus. Stimuli consisted of actual utterances excised from the Switchboard corpus, nigh quality recordings of utterances that occurred in Switchboard conversations, and recordings of word sequences with zero, medium and high bigram probabilities based on a language model computed from transcriptions of the Switchboard corpus. The results show that human listeners are very good at recognizing words in the absence of word sequence constraints, and that statistical language models fail to capture much of the high level lingusitic information needed to recognize words in fluent speech. The results are discussed in terms of their implications to current approaches to acoustic and language modeling in computer speech recognition.
Bibliographic reference. Cole, Ronald A. / Yan, Yonghong / Bailey, Troy (1996): "The influence of bigram constraints on word recognition by humans: implications for computer speech recognition", In ICSLP-1996, 829-832.