Ngram-based Statistical Machine Translation relies on a standard Ngram language model of tuples to estimate the translation process. In training, this translation model requires a segmentation of each parallel sentence, which involves taking a hard decision on tuple segmentation when a word is not linked during word alignment. This is especially critical when this word appears in the target language, as this hard decision is compulsory.
In this paper we present a thorough study of this situation, comparing for the first time each of the proposed techniques in two independent tasks, namely English-Spanish European Parliament Proceedings large-vocabulary task and Arabic-English Basic Travel Expressions small-data task. In the face of this comparison, we present a novel segmentation technique which incorporates linguistic information. Results obtained in both tasks outperform all previous techniques.
Cite as: Gispert, A.d., Mariño, J.B. (2006) Linguistic tuple segmentation in n-gram-based statistical machine translation. Proc. Interspeech 2006, paper 1049-Tue2CaP.1, doi: 10.21437/Interspeech.2006-350
@inproceedings{gispert06_interspeech, author={Adrià de Gispert and José B. Mariño}, title={{Linguistic tuple segmentation in n-gram-based statistical machine translation}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1049-Tue2CaP.1}, doi={10.21437/Interspeech.2006-350} }