ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Weakly supervised training for parsing Mandarin broadcast transcripts

Wen Wang

We present a systematic investigation of applying weakly supervised co-training approaches to improve parsing performance for parsing Mandarin broadcast news (BN) and broadcast conversation (BC) transcripts, by iteratively retraining two competitive Chinese parsers from a small set of treebanked data and a large set of unlabeled data. We compare co-training to self-training, and our results show that performance using co-training is significantly better than with self-training and both co-training and self-training with a small seed labeled corpus can improve parsing accuracy significantly over training on the mismatching newswire treebank. We also investigate a variety of example selection approaches for co-training and find that our proposed example selection approach based on maximizing training utility produces the best parsing accuracy. We also investigate Chinese parsing related issues including character-based parsing and word segmentation for parsing.

doi: 10.21437/Interspeech.2008-607

Cite as: Wang, W. (2008) Weakly supervised training for parsing Mandarin broadcast transcripts. Proc. Interspeech 2008, 2446-2449, doi: 10.21437/Interspeech.2008-607

  author={Wen Wang},
  title={{Weakly supervised training for parsing Mandarin broadcast transcripts}},
  booktitle={Proc. Interspeech 2008},