International Workshop on Spoken Language Translation (IWSLT) 2011
San Francisco, CA, USA
In this paper, we investigate lexicon models for hierarchical
phrase-based statistical machine translation. We study
five types of lexicon models: a model which is extracted
from word-aligned training data and - given the word alignment
matrix - relies on pure relative frequencies ; the
IBM model 1 lexicon ; a regularized version of IBM
model 1; a triplet lexicon model variant ; and a discriminatively
trained word lexicon model . We explore source-to-target
models with phrase-level as well as sentence-level
scoring and target-to-source models with scoring on phrase
level only. For the first two types of lexicon models, we compare
several scoring variants. All models are used during
search, i.e. they are incorporated directly into the log-linear
model combination of the decoder.
Phrase table smoothing with triplet lexicon models and with discriminative word lexicons are novel contributions. We also propose a new regularization technique for IBM model 1 by means of the Kullback-Leibler divergence with the empirical unigram distribution as regularization term.
Experiments are carried out on the large-scale NIST Chinese-to-English translation task and on the English-to-French and Arabic-to-English IWSLT TED tasks. For Chinese-to-English and English-to-French, we obtain the best results by using the discriminative word lexicon to smooth our phrase tables.
Bibliographic reference. Huck, Matthias / Mansour, Saab / Wiesler, Simon / Ney, Hermann (2011): "Lexicon models for hierarchical phrase-based machine translation", In IWSLT-2011, 191-198.