Numerous empirical results have shown that combining data from multiple domains often improve statistical machine translation (SMT) performance. For example, if we desire to build SMT for the medical domain, it may be beneficial to augment the training data with bitext from another domain, such as parliamentary proceedings. Despite the positive results, it is not clear exactly how and where additional outof- domain data helps in the SMT training pipeline. In this work, we analyze this problem in detail, considering the following hypotheses: out-of-domain data helps by either (a) improving word alignment or (b) improving phrase coverage. Using amultitude of datasets (IWSLT-TED, EMEA, Europarl, OpenSubtitles, KDE), we show that sometimes outof- domain data may help word alignment more than it helps phrase coverage, andmore flexible combination of data along different parts of the training pipeline may lead to better results.
Cite as: Duh, K., Sudoh, K., Tsukada, H. (2010) Analysis of translation model adaptation in statistical machine translation. Proc. International Workshop on Spoken Language Translation (IWSLT 2010), 243-250
@inproceedings{duh10_iwslt, author={Kevin Duh and Katsuhito Sudoh and Hajime Tsukada}, title={{Analysis of translation model adaptation in statistical machine translation}}, year=2010, booktitle={Proc. International Workshop on Spoken Language Translation (IWSLT 2010)}, pages={243--250} }