9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Weakly Supervised Training for Parsing Mandarin Broadcast Transcripts

Wen Wang

SRI International, USA

We present a systematic investigation of applying weakly supervised co-training approaches to improve parsing performance for parsing Mandarin broadcast news (BN) and broadcast conversation (BC) transcripts, by iteratively retraining two competitive Chinese parsers from a small set of treebanked data and a large set of unlabeled data. We compare co-training to self-training, and our results show that performance using co-training is significantly better than with self-training and both co-training and self-training with a small seed labeled corpus can improve parsing accuracy significantly over training on the mismatching newswire treebank. We also investigate a variety of example selection approaches for co-training and find that our proposed example selection approach based on maximizing training utility produces the best parsing accuracy. We also investigate Chinese parsing related issues including character-based parsing and word segmentation for parsing.

Full Paper

Bibliographic reference.  Wang, Wen (2008): "Weakly supervised training for parsing Mandarin broadcast transcripts", In INTERSPEECH-2008, 2446-2449.