We present a systematic investigation of applying weakly supervised co-training approaches to improve parsing performance for parsing Mandarin broadcast news (BN) and broadcast conversation (BC) transcripts, by iteratively retraining two competitive Chinese parsers from a small set of treebanked data and a large set of unlabeled data. We compare co-training to self-training, and our results show that performance using co-training is significantly better than with self-training and both co-training and self-training with a small seed labeled corpus can improve parsing accuracy significantly over training on the mismatching newswire treebank. We also investigate a variety of example selection approaches for co-training and find that our proposed example selection approach based on maximizing training utility produces the best parsing accuracy. We also investigate Chinese parsing related issues including character-based parsing and word segmentation for parsing.
Bibliographic reference. Wang, Wen (2008): "Weakly supervised training for parsing Mandarin broadcast transcripts", In INTERSPEECH-2008, 2446-2449.