Benchmarking Benchmarks: Introducing New Automatic Indicators for Benchmarking Spoken Language Understanding Corpora

Frédéric Béchet, Christian Raymond


Empirical evaluation is nowadays the main evaluation paradigm in Natural Language Processing for assessing the relevance of a new machine-learning based model. If large corpora are available for tasks such as Automatic Speech Recognition, this is not the case for other tasks such as Spoken Language Understanding (SLU), consisting in translating spoken transcriptions into a formal representation often based on semantic frames. Corpora such as ATIS or SNIPS are widely used to compare systems, however differences in performance among systems are often very small, not statistically significant, and can be produced by biases in the data collection or the annotation scheme, as we presented on the ATIS corpus (“Is ATIS too shallow?, IS2018”). We propose in this study a new methodology for assessing the relevance of an SLU corpus. We claim that only taking into account systems performance does not provide enough insight about what is covered by current state-of-the-art models and what is left to be done. We apply our methodology on a set of 4 SLU systems and 5 benchmark corpora (ATIS, SNIPS, M2M, MEDIA) and automatically produce several indicators assessing the relevance (or not) of each corpus for benchmarking SLU models.


 DOI: 10.21437/Interspeech.2019-3033

Cite as: Béchet, F., Raymond, C. (2019) Benchmarking Benchmarks: Introducing New Automatic Indicators for Benchmarking Spoken Language Understanding Corpora. Proc. Interspeech 2019, 4145-4149, DOI: 10.21437/Interspeech.2019-3033.


@inproceedings{Béchet2019,
  author={Frédéric Béchet and Christian Raymond},
  title={{Benchmarking Benchmarks: Introducing New Automatic Indicators for Benchmarking Spoken Language Understanding Corpora}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4145--4149},
  doi={10.21437/Interspeech.2019-3033},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3033}
}