2nd Workshop on Spoken Language Technologies for Under-Resourced Languages

Universiti Sains, Penang, Malaysia
May 3-5, 2010

Unsupervised SMT for a Low-Resourced Language Pair

Thi-Ngoc-Diep Do (1,2), Laurent Besacier (1), Eric Castelli (2)

(1) LIG Laboratory, CNRS/UMR-5217, Grenoble, France
(2) MICA Center – HUT, Hanoi, Vietnam

This paper presents an unsupervised method in application of extracting parallel sentence pairs from a comparable corpus. A translation system is used to mine the comparable corpus and to withdraw the parallel sentence pairs. An iteration process is implemented not only to increase the number of extracted parallel sentence pairs but also to improve the quality of translation system. A comparison between this unsupervised method and a semi-supervised method is also presented. The unsupervised extracting method was tested in a hard condition: the parallel corpus did not exist and the comparable corpus contained up to 50% of non parallel sentence pairs. However, the result shows that the unsupervised method can be really applied in the case of lacking parallel data.

Index Terms: unsupervised method, extract parallel sentence pairs, comparable corpus.

Full Paper

Bibliographic reference.  Do, Thi-Ngoc-Diep / Besacier, Laurent / Castelli, Eric (2010): "Unsupervised SMT for a low-resourced language pair", In SLTU-2010, 130-135.