International Workshop on Spoken Language Translation (IWSLT) 2010

Paris, France
December 2-3, 2010

Resources for Adding Semantics to Machine Translation

Jan Hajič

Charles University in Prague, Czech Republic

Current (Statistical) Machine Translation systems rarely go beyond morphology, lemmatization, phrases or syntax. One of the possible ways to direct research in the near future is use semantics in one way or the other, whether as semantics features or factors within the successful phrase-based or hierarchical systems, or in hybrid systems, or otherwise. However, semantic features have to be learnt from annotated data, at least until unsupervised learning can replace all the expensive annotation projects. In the talk, I will present the basics of the family of Prague dependency treebanks (currently available for Czech, English and Arabic), which to various extents provide combined manual annotation of syntax and semantics based on the dependency framework, but general enough to be used in systems of all types, including the classical non-hierarchical SMT systems where only word-based features can be incorporated into the model. One of the corpora available is specifically aimed at machine translation, since it is a parallel, fully manually annotated Czech-English corpus, which consists of the Penn Treebank texts (preserving also the original annotation) and its professional translation to Czech. Specific resources aimed at spoken language analysis will also be presented, even though no parallel version exists yet. These are based on the "speech reconstruction" idea by Fred Jelinek and his students, which was incorporated into a dialog corpus of Czech and English that was then developed at Charles University.


Bibliographic reference.  Hajič, Jan (2010): "Resources for adding semantics to machine translation", In IWSLT-2010, 403.