The corpus is an invaluable resource in Spoken and Natural Language Processing. Consistent data sets has allowed for empirical evaluation of competing algorithms. The sharing of high-quality annotated linguistic data has enabled participation and experimentation by a wide range of researchers. However, despite dubbing these annotations as "gold-standard", many corpora contain labeling errors and idiosyncrasies. The current view of the corpus as a static resource make correction of errors and other modifications prohibitively difficult. In this paper, a perspective of the corpus as dynamically changing is advanced. Version control software can provide a mechanism to facilitate this. We highlight the problems of the static view of the corpus through case studies of the Penn Treebank, Switchboard, Hub-4 and Boston University Radio News Corpus.
Index Terms: Linguistic Resources, Opinion paper
Bibliographic reference. Rosenberg, Andrew (2012): "Rethinking the corpus: moving towards dynamic linguistic resources", In INTERSPEECH-2012, 1392-1395.