Language models (LMs) have gained dramatic improvement in the past years due to the wide application of neural networks. This raises the question of how far we are away from the perfect language model and how much more research is needed in language modelling. As for perplexity giving a value for human perplexity (as an upper bound of what is reasonably expected from an LM) is difficult. Word error rate (WER) has the disadvantage that it also measures the quality of other components of a speech recognizer like the acoustic model and the feature extraction. We therefore suggest evaluating LMs in a generative setting (which has been done before on selected hand-picked examples) and running a human evaluation on the generated sentences. The results imply that LMs need about 10 to 20 more years of research before human performance is reached. Moreover, we show that the human judgement scores on the generated sentences and perplexity are closely correlated. This leads to an estimated perplexity of 12 for an LM that would be able to pass the human judgement test in the setting we suggested.
Cite as: Shen, X., Oualil, Y., Greenberg, C., Singh, M., Klakow, D. (2017) Estimation of Gap Between Current Language Models and Human Performance. Proc. Interspeech 2017, 553-557, doi: 10.21437/Interspeech.2017-729
@inproceedings{shen17_interspeech, author={Xiaoyu Shen and Youssef Oualil and Clayton Greenberg and Mittul Singh and Dietrich Klakow}, title={{Estimation of Gap Between Current Language Models and Human Performance}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={553--557}, doi={10.21437/Interspeech.2017-729} }