The performance of phrase-based SMT systems is crucially dependent on the quality of the extracted phrase pairs, which is in turn a function of word alignment quality. Data sparsity, an inherent problem in SMT even with large training corpora, has an adverse impact on the reliability of the extracted phrase translation pairs. We present a novel feature based on bootstrap resampling of the training corpus, termed phrase alignment confidence, that measures the goodness of a phrase translation pair. We integrate this feature within a phrase-based SMT system and show an improvement of 1.7% BLEU and 4.4% METEOR over a baseline English-to-Pashto (E2P) SMT system that does not use any measure of phrase pair quality. We then show that the proposed measure compares well to an existing indicator of phrase pair reliability, the lexical smoothing probability. We also demonstrate that combining the two measures leads to a further improvement of 0.4% BLEU and 0.3% METEOR on the E2P system.
Bibliographic reference. Ananthakrishnan, Sankaranarayanan / Prasad, Rohit / Natarajan, Prem (2010): "Phrase alignment confidence for statistical machine translation", In INTERSPEECH-2010, 2878-2881.