11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Creating a Linguistic Plausibility Dataset with Non-Expert Annotators

Benjamin Lambert, Rita Singh, Bhiksha Raj

Carnegie Mellon University, USA

We describe the creation of a linguistic plausibility dataset that contains annotated examples of language judged to be linguistically plausible, implausible, and every-thing in between. To create the dataset we randomly generate sentences and have them annotated by crowd sourcing over the Amazon Mechanical Turk. Obtaining inter-annotator agreement is a difficult problem because linguistic plausibility is highly subjective. The annotations obtained depend, among other factors, on the manner in which annotators are ques- tioned about the plausibility of sentences. We describe our experi- ments on posing a number of different questions to the annotators, in order to elicit the responses with greatest agreement, and present several methods for analyzing the resulting responses. The generated dataset and annotations are being made available to public.

Full Paper

Bibliographic reference.  Lambert, Benjamin / Singh, Rita / Raj, Bhiksha (2010): "Creating a linguistic plausibility dataset with non-expert annotators", In INTERSPEECH-2010, 1906-1909.