ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Parsing with subdomain instance weighting from raw corpora

Barbara Plank, Khalil Sima'an

The treebanks that are used for training statistical parsers consist of hand-parsed sentences from a single source/domain like newspaper text. However, newspaper text concerns different subdomains of language use (e.g. finance, sports, politics, music), which implies that the statistics gathered by generative statistical parsers are averages over subdomain statistics. In this paper we explore a method, subdomain instance-weighting, that exploits raw subdomain corpora for introducing subdomain statistics into a state-of-the-art generative parser. We employ instance-weighting for creating an ensemble of subdomain specific versions of the parser, and explore methods for amalgamating their predictions. Our experiments show that subdomain statistics extracted from raw corpora can even improve the quality of the n-best lists of a formidable, state-of-the-art parser.

Cite as: Plank, B., Sima'an, K. (2008) Parsing with subdomain instance weighting from raw corpora. Proc. Interspeech 2008, 2540

  author={Barbara Plank and Khalil Sima'an},
  title={{Parsing with subdomain instance weighting from raw corpora}},
  booktitle={Proc. Interspeech 2008},